Meet the Julians

JuliaCon, the annual gathering of the community focused on the Julia programming language, convened last week at MIT. I hung out there for a couple of days, learned a lot, and had a great time. I want to report back with a few words about the Julia language and a few more about the Julia community.

the celebratory cake at the conference, inscribed JuliaCon 2015


It’s remarkable that after 70 years of programming language development, we’re still busily exploring all the nooks and crannies of the design space. A slide from a talk by David Beach, listing traits of various programming languages (C++, Java, Python, R, Matlab, and Fortran; Julia's trait is The last time I looked at the numbers, there were at least 2,500 programming languages, and maybe 8,500. But it seems there’s always room for one more. The slide at left (from a talk by David Beach of Numerica) sums up the case for Julia: We need something easier than C++, faster than Python, freer than Matlab, and newer than Fortran. Needless to say, the consensus at this meeting was that Julia is the answer.

Where does Julia fit in among all those older languages? The surface syntax of Julia code looks vaguely Pythonic, turning its back on the fussy punctuation of the C family. Other telltale traits suggest a heritage in Fortran and Matlab; for example, arrays are indexed starting with 1 rather than 0, and they are stored in column-major order. And there’s a strong suite of tools for working with matrices and other elements of linear algebra, appealing to numericists. Looking a little deeper, some of the most distinctive features of Julia have no strong connection with any of the languages mentioned in Beach’s slide. In particular, Julia relies heavily on generic functions, which came out of the Lisp world. (Roughly speaking, a generic function is a collection of methods, like the methods of an object in Smalltalk, but without the object.)

Perhaps a snippet of code is a better way to describe the language than all these comparisons. Here’s a fibonacci function:

function fib(n)
    a = b = BigInt(1)
    for i in 1:n
        a, b = b, a+b
    end
    return a
end

Note the syntax of the for loop, which is similar to Python’s for i in range(n):, and very different from C’s for (var i=0; i<n; i++). But Julia dispenses with Python’s colons, instead marking the end of a code block. And indentation is strictly for human readers; it doesn’t determine program meaning, as it does in Python.

For a language that emphasizes matrix operations, maybe this version of the fibonacci function would be considered more idiomatic:

function fibmat(n)
    a = BigInt[1 1; 1 0]
    return (a^n)[1, 2]
end

What’s happening here, in case it’s not obvious, is that we’re taking the nth power of the matrix \[\begin{bmatrix}1& 1\\1& 0\end{bmatrix}\] and returning the lower left element of the product, which is equal to the nth fibonacci number. The matrix-power version is 25 times faster than the loop version.

@time fib(10000)
elapsed time: 0.007243102 seconds (4859088 bytes allocated)

@time fibmat(10000)
elapsed time: 0.000265076 seconds (43608 bytes allocated)

Julia’s base language has quite a rich assortment of built-in functions, but there are also 600+ registered packages that extend the language in ways large and small, as well as a package manager to automate their installation. The entire Julia ecosystem is open source and managed through GitHub.

When it comes to programming environments, Julia offers something for everybody. You can use a traditional edit-compile-run cycle; there’s a REPL that runs in a terminal window; and there’s a lightweight IDE called Juno. But my favorite is the IPython/Jupyter notebook interface, which works just as smoothly for Julia as it does for Python. (With a cloud service called JuliaBox, you can run Julia in a browser window without installing anything.)


I’ve been following the Julia movement for a couple of years, but last week’s meeting was my first exposure to the community of Julia developers. Immediate impression: Youth! It’s not just that I was the oldest person in the room; I’m used to that. It’s how much older. Keno Fischer is now an undergrad at Harvard, but he was still in high school when he wrote the Julia REPL. Zachary Yedidia, who demoed an amazing package for physics-based simulations and animations, has not yet finished high school. Several other speakers were grad students. Even the suits in attendance—a couple of hedge fund managers whose firm helped fund the event—were in jeans with shirt tails untucked.

Four leaders of the Julia movement at JuliaCon 2015. From left: Stefan Karpinski, Viral B. Shah, Jeff Bezanson, Keno Fischer.

Four of the ringleaders of the Julia movement. From left: Stefan Karpinski, Viral B. Shah, Jeff Bezanson, Keno Fischer.

Second observation: These kids are having fun! They have a project they believe in; they’re zealous and enthusiastic; they’re talented enough to build whatever they want and make it work. And the world is paying attention. Everybody gets to be a superhero.


By now we’re well into the second generation of the free software movement, and although the underlying principles haven’t really changed, the vibe is different. Early on, when GNU was getting started, and then Linux, and projects like OpenOffice, the primary goal was access to source code, so that you could know what a program was doing, fix it if it broke, customize it to meet your needs, and take it with you when you moved to new hardware. Within the open-source community, that much is taken for granted now, but serious hackers want more. The game is not just to control your own copy of a program but to earn influence over the direction of the project as a whole. To put it in GitHub terminology, it’s not enough to be able to clone or fork the repo, and thereby get a private copy; you want the owners of the repo to accept your pull requests, and merge your own work into the main branch of development.

GitHub itself may have a lot to do with the emergence of this mode of collective work. It puts everything out in public—not just the code but also discussions among programmers and a detailed record of who did what. And it provides a simple mechanism for anyone to propose an addition or improvement. Earlier open-source projects tended to put a little more friction into the process of becoming a contributor.

In any case, I am fascinated by the social structure of the communities that form around certain GitHub projects. There’s a delicate balance between collaboration (everybody wants to advance the common cause) and competition (everyone wants to move up the list of contributors, ranked by number of commits to the code base). Maintaining that balance is also a delicate task. The health of the enterprise depends on attracting brilliant and creative people, and persuading them to freely contribute their work. But brilliant creative people bring ideas and agendas of their own.

The kind of exuberance I witnessed at JuliaCon last week can’t last forever. That’s sad, but there’s no helping it. One reason we have those 2,500 (or 8,500) programming languages is that it’s a lot more fun to invent a new one than it is to maintain a mature one. Julia is still six tenths of a version number short of 1.0, with lots of new territory to explore; I plan to enjoy it while I can.


Quick notes on a few of the talks at the conference.

Zenna Tavares discusses probabilistic programming with Sigma.jl at JuliaCon 2015

Zenna Tavares described sigma.jl, a Julia package for probabilistic programming—another hot topic I’m trying to catch up with. Probabilistic programming languages try to automate the process of statistical modeling and inference, which means they need to build things like Markov chain Monte Carlo solvers into the infrastructure of the programming language. Tavares’s language also has a SAT solver built in.

Chiyuan Zhang gave us mocha.jl, a deep-learning/neural-network package inspired by the C++ framework Caffe. Watching the demo, I had the feeling I might actually be able to set up my own multilayer neural network on my kitchen table, but I haven’t put that feeling to the test yet.

Finally, because of parallel sessions I missed the first half of Pontus Stenetorp’s talk on “Suitably Naming a Child with Multiple Nationalities using Julia.” I got there just in time for the big unveiling. I was sure the chosen name would turn out to be “Julia.” But it turns out the top three names for the offspring of Swedish and Japanese parents is:

Best Sweedish-Japanese names, according to Pontus Stenetorp's Julia program 4573

Steneport wants to extend his algorithm to more language pairs. And he also needs to tell his spouse about the results of this work.

Posted in computing | 3 Comments

Fast money, slow data

The other day I bought lunch from a food truck on the Harvard campus, paying with a debit card that my server swiped through one of those little plastic doodads attached to an iPhone. Ten minutes later, while I ate my báhn mì in a shady corner of Harvard Yard, I opened my laptop and logged on to my bank’s website. The $7 transaction had already been posted to my account.

So here’s a question to think about: If the credit card networks and the banks can track the movement of money minute by minute, why does it take months to calculate the overall level of economic activity in the U.S.?

The Department of Commerce reports on gross domestic product (GDP) at quarterly intervals. This number (or its first derivative) is often cited as a kind of benchmark of economic wellbeing. The initial announcement comes a month after the close of each quarter; for Q1 2015, the news was released April 29. That news wasn’t cheering: a measly 0.2 percent rate of growth. The New York Times coverage began, “Repeating an all-too-familiar pattern, the American economy slowed to a crawl in the first quarter of 2015….”

But that wasn’t the last word on the first quarter. At the end of May, a revised report came out, saying the winter months had been even bleaker than first thought: not +0.2 percent but –0.7 percent. Then yesterday a third and “final” estimate was released (PDF). It was a Goldilocks number, –0.2 percent, pretty near the mean of the two earlier figures. Said the Times: “While nothing to brag about, the economy’s performance in early 2015 was not quite as bad as the number-crunchers in Washington had thought.”

If we take it for granted that the end-of-June GDP estimate is somewhere near correct, then the first two reports were worse than useless; they were misleading. Taking action on the basis of those numbers—making an investment, say, or setting an interest rate—would have been foolish. It seems that if you want to know how the economy is doing in the first quarter, you have to wait until the second quarter ends. And of course we’re about to repeat the cycle. How did business fare this spring? Check back at the end of September, when my $7 food-truck sandwich may finally register in the government’s books.


Why so slow? To answer this, I thought I ought to learn a little something about how the Department of Commerce calculates GDP. Here’s where I learned that little something:

Landefeld, J. Steven, Eugene P. Seskin, and Barbara M. Fraumeni. 2008. Taking the pulse of the economy: Measuring GDP. Journal of Economic Perspectives 22(2):193–216. PDF.

It’s worse than I had guessed. The baseline for all these “national accounts estimates” is an economic census conducted every five years (in years congruent to 2 mod 5). For the 19 quarters between one economic census and the next, numbers are filled in by extrapolation, guided by a miscellany of other data sources. Landefeld et al. write:

The challenge lies in developing a framework and methods that take these economic census data and combine them using a mosaic of monthly, quarterly, and annual economic indicators to produce quarterly and annual GDP estimates. For example, one problem is that the other economic indicators that are used to extrapolate GDP in between the five-year economic census data—such as retail sales, housing starts, and manufacturers shipments of capital goods—are often collected for purposes other than estimating GDP and may embody definitions that differ from those used in the national accounts. Another problem is some data are simply not available for the earlier estimates. For the initial monthly estimates of quarterly GDP, data on about 25 percent of GDP— especially in the service sector—are not available, and so these sectors of the economy are estimated based on past trends and whatever related data are available. For example, estimates of consumer spending for electricity and gas are extrapolated using past heating and cooling degree data and the actual temperatures, while spending for medical care, education, and welfare services are extrapolated using employment, hours, and earnings data for these sectors from the Bureau of Labor Statistics.

Is this methodology really the best way to measure the state of the economy in the age of Big Data?

Actually, there’s a lot to be said for the quinquennial economic census. It goes to a large sample (four million businesses), and the companies are compelled by law to respond, which mitigates selection bias. The data series goes back to 1934; maintaining continuity with past measurements is valuable. Furthermore, the census and other survey-based instruments probe much more than just transactional data. They try to quantify what a company manufactures (or what services it provides), what labor and material inputs are consumed, and where the company’s revenue comes from. The analysis includes not just income and expenditures but also depreciation and amortization and other scary abstractions from the world of accounting. You can’t get all that detail just by following the money as it sloshes around the banking system.

Still, can we afford to wait five years between surveys? Three months before we get a reliable (?) guess about what happened in the previous quarter? Consider the predicament of the Federal Reserve Board, trying to walk the narrow line between encouraging economic growth and managing inflation. This is a problem in control theory, where delayed feedbacks lead to disastrous instabilities. (Presumably Janet Yellin and her colleagues have access to more timely data than I do. I hope so.)

Could we really create an up-to-the-minute measure of national economic health by mining credit card data, bank accounts, supermarket inventories, and food-truck receipts? I really don’t know. But I’ll quote a 2014 Science review by Liran Einav and Jonathan Levin:

Whereas the popular press has focused on the vast amount of information collected by Internet companies such as Google, Amazon, and Facebook, firms in every sector of the economy now routinely collect and aggregate data on their customers and their internal businesses. Banks, credit card companies, and insurers collect detailed data on household and business financial interactions. Retailers such as Walmart and Target collect data on consumer spending, wholesale prices, and inventories. Private companies that specialize in data aggregation, such as credit bureaus or marketing companies such as Acxiom, are assembling rich individual-level data on virtually every household….

One potential application of private sector data is to create statistics on aggregate economic activity that can be used to track the economy or as inputs to other research. Already the payroll service company ADP publishes monthly employment statistics in advance of the Bureau of Labor Statistics, MasterCard makes available retail sales numbers, and Zillow generates house price indices at the county level. These data may be less definitive than the eventual government statistics, but in principle they can be provided faster and perhaps at a more granular level, making them useful complements to traditional economic statistics.

I suppose I should also mention worries about giving government agencies access to so much personal financial data. But that horse is out of the barn.

Posted in social science, statistics | 3 Comments

Traffic Jams in Javascript

Building bigger roads can make traffic worse. The usual explanation for this counter­intuitive and counterproductive outcome is that bigger roads attract bigger shopping malls, which in turn attract more cars. But that’s not the whole story. In the 1960s Dietrich Braess discovered a theoretical configuration of roadways where building a new connecting road can make everyone’s trip slower even when the number of vehicles is held constant. Conversely, closing a road in the Braess network can allow everyone to get home quicker. The behavior is weird enough to justify the term “Braess’s paradox.”

A few years ago Joel Cohen suggested to me that Braess’s paradox might make a good topic for my “Computing Science” column. I hesitated. There were already plenty of published discussions of the paradox, including excellent ones by Cohen himself, as well as a book by Tim Roughgarden (which I had reviewed for American Scientist). I didn’t think I had anything more to add to the conversation.

Lately, though, I began to consider the problem of visualizing Braess’s paradox—presenting it in such a way that you can watch individual cars navigating through the road network, rather than merely calculating average speeds and travel times. Having an opportunity to play with the model—fiddling with the knobs and switches, trying different routing algorithms—might lead to a clearer understanding of why well-informed and self-interested drivers would choose a route that winds up making everybody late for dinner.

I now have a working JavaScript model of something resembling Braess’s paradox, which I urge you to go take for a test drive. There’s an accompanying “Computing Science” column in the latest issue of American Scientist, available on the magazine website in HTML or PDF. If you’re interested in the source code, it’s on Github. Here I’m going to say a few words about the implementation of the model and what I’ve learned from it.


Adapting Braess’s mathematical model to a more mechanistic and visual medium was trickier than I had expected. The original formulation is quite abstract and not very physical; it’s closer to graph theory than to highway engineering. In the diagrams below the wide, blue roads labeled A and B never become congested; no matter how much traffic they carry, the transit time for these links is always one hour. The narrow red roads a and b offer a transit time of zero when they’re empty, but traffic slows as the load increases; if everyone crowds onto a single red link, the transit time is again one hour. As for the gold route X, this magical thoroughfare offers instantaneous transport for any number of vehicles.

diagrams of the Braess diamond road network

The presence or absence of the X road is what triggers the paradox. Without the golden crosslink (left diagram), traffic splits equally between the Ab and aB routes, and all cars take 90 minutes to complete the trip. When the X link is opened (right diagram) all drivers favor route aXb, and everyone spends a full two hours on the road.

The essential supposition behind the paradox is that everyone follows a selfish routing policy, choosing whichever route offers the quickest trip, and ignoring all factors other than travel time. And, ironically, it’s the insistence that no one else may have a shorter trip that leaves all the drivers in a bumper-to-bump jam on route aXb, while AXB sits empty. Why? If any driver decided to defect to AXB, the departure would marginally lesson the load on aXb, giving the latter route a slight speed advantage. Under the selfish policy, the defecting driver would then have to return to aXb. It’s a stalemate.


Cars moving at infinite speed and roads with infinite capacity may be all right in a mathematical model, but they are troublesome in a simulation that’s supposed to look something like real highway traffic. Seeking a more realistic model, I settled on the road layout shown below, inspired by a diagram in a 1997 paper by Claude M. Pinchina (Braess Paradox: maximum penalty in a minimal critical network. Transportation Research A 31(5):379–388).

Road layout of the JavaScript traffic simulation.

The topology is the same as that of Braess’s original network, but the geometry is different, and so is the relation between congestion and speed. The aim is still to get from start to end—or, rather, from Origin to Destination. The A and B road segments are again wide and insensitive to traffic congestion. The a and b roads are straighter and shorter but also narrower. With zero traffic, vehicle speed is the same on a and b as on A and B, but as traffic gets heavier, speed falls off. The analogue of the “golden road” is a short bridge at the center of the map, which has the same speed properties as A and B. In the initial configuration, the bridge is blocked off, but a mouseclick opens it. In the snapshot of the running model shown above, the bridge is open and carrying traffic.

Cars, represented by colored dots, enter the system at the Origin. At the moment of launching, each car chooses one of the available routes. If the bridge is closed, there are just two possibilities: Ab and aB. With the bridge open, drivers also have the option of the ab shortcut or the longer AB. The cars then advance along the selected roadways, governed by the speed restrictions on each segment, until they reach the Destination.

This scheme differs in several important ways from the original Braess formulation. Does it still exhibit the paradox? In other words, does the travel time increase when the bridge is opened, allowing traffic to flow over routes ab and AB? The answer is yes, for a wide range of parameter values, as shown in these outputs:

Braess paradox timings

The tables show the number of vehicles choosing each route and the average time they spend on the road. (The times are measured in units of the quickest possible trip: taking the shortest route ab with zero traffic.) Note that opening up the bridge slowed down travel on all four routes. Even though the Ab and aB routes carried about 37 percent less traffic when the bridge was open, cars on those routes took 9 to 15 percent longer to complete their journeys. The ab and AB routes were even slower.


But the numbers don’t tell the whole story—that was the first thing I learned when I got the simulation running. In the bridge-closed state, where the total traffic splits into two roughly equal streams, you might imagine successive cars alternating between Ab and aB, so that the system would reach a statistical steady state with equal numbers of cars on each of the two routes at any given moment. That’s not at all what happens! The best way to see what does happen is to go run the simulation, but the graph below also gives a clue.

number of cars on routes Ab and aB as a function of time

Instead of settling into a steady state, the system oscillates with a period of a little less than 500 times steps, which is roughly half the time it takes a typical car to make a complete trip from Origin to Desitination. The two curves are almost exactly 180 degrees out of phase.

It’s not hard to understand where those oscillations come from. As each car enters the road network, it chooses the route with the shortest expected travel time, based on conditions at that moment. The main determinant of expected travel time is the number of cars occupying the a and b segments, where speed decreases as the roads get more crowded. But when cars choose the less-popular route, they also raise the occupany of that route, making it appear less favorable to the cars behind them. Furthermore, on the Ab route there is a substantial delay before the cars reach the congestion-sensitive segment. The delay and the asymmetry of the network create an instability—a feedback loop in which overshooting and overcorrection are inevitable.

When the connecting bridge is opened, the pattern is more complicated, but oscillations are still very much in evidence:

Four route car census

There seem to be two main alternating phases—one where Ab and ab dominate (the Christmas phase, in this color scheme), and the other where ab and AB take over (the Cub Scout phase). The period of the waves is less regular but mostly longer.

I am not the first to observe these oscillations; they are mentioned, for example, by Daniele Buscema and colleagues in a paper describing a NetLogo simulation of Braess-like road networks. Overall, though, the osccilations and instabilities seem to have gotten little attention in the literature.

The asymmetry of the layout is crucial to generating both the oscillations and the paradoxical slowdown on opening the central connecting link. You can see this for yourself by running a symmetric version of the model. It’s quite dull.


One more bug/feature of the dynamic model is worth a comment. In the original Braess network, the A and B links have unlimited capacity; in effect, the model promises that the time for traversing those roads is a constant, regardless of traffic. In a dynamic model with discrete vehicles of greater-than-zero size, that promise is hard to keep. Consider cars following the Ab route, and suppose the b segment is totally jammed. At the junction where A disgorges onto b, cars have nowhere to go, and so they spill back onto the A segment, which can therefore no longer guarantee constant speed.

In implementing the dynamic model, I discovered there were a lot of choices to be made where I could find little guidance in the mathematical formulation of the original Braess system. The “queue spillback” problem was one of them. Do you let the cars pile up on the roadway, or do you provide invisible buffers of some kind where they can quietly wait their turn to continue their journey? What about cars that present themselves at the origin node at a moment when there’s no room for them on the roadways? Do you queue them up, do you throw them away, do you allow them to block cars headed for another road? Another subtle question concerns priorities and fairness. The two nodes near the middle of the road layout each have two inputs and two outputs. If there are cars lined up on both input queues waiting to move through the node, who goes first? If you’re not careful in choosing a strategy to deal with such contention, one route can be permanently blockaded by another.

You can see which choices I made by reading the JavaScript source code. I won’t try to argue that my answers are the right ones. What’s probably more important is that after a lot of experimenting and exploring alternatives, I found that most of the details don’t matter much. The Braess effect seems to be fairly robust, appearing in many versions of the model with slightly different assumptions and algorithms. That robustness suggests we might want to search more carefully for evidence of paradoxical traffic patterns out on real highways.

Posted in computing, mathematics, modern life | 12 Comments

Whales of the Web

The average web site has links connecting it with 29 other sites. I came up with this number in the following way. A data set I’ve been playing with for a few weeks lists 43 million web sites and 623 million links between sites; the quotient of those numbers is about 14.5. Since each link has two ends, the per-site total of inbound plus outbound links is double the quotient.

Twenty-nine links is the average, but by no means is it a typical number of links. Almost half of the sites have four or fewer links. At the other end of the spectrum, the most-connected web site (blogspot.com) has almost five million links, and there are six more sites with at least a million each. The distribution of link numbers—the degree sequence—looks like this (both scales are logarithmic, base 2):

Degree sequence of the WWW

I want to emphasize that these are figures for web sites, not web pages. The unit of aggregation is the “pay-level domain”—the domain name you have to pay to register. Examples are google.com or bbc.co.uk. Subdomains, such as maps.google.com, are all consolidated under the main google.com entry. Any number of links from pages on site A to pages on site B are recorded as a single link from A to B.

The source of these numbers is the Web Data Commons, a project overseen by a group at the University of Mannheim. They extracted the lists of domains and the links between them from a 2012 data set compiled and published by the Common Crawl Foundation (which happens to be the subject of my latest American Scientist column). The Common Crawl does essentially the same thing as the big search engines—download the whole Web, or some substantial fraction of it—but the Common Crawl makes the raw data publicly available.

There are interesting questions about both ends of the degree sequence plotted above. At the far left, why are there so many millions of lonely, disconnected web sites, with just one or two links, or none at all? I don’t yet feel I know enough to tell the story of those orphans of the World Wide Web. I’ve been focused instead on the far right of the graph, on the whales of the Web, the handful of sites with links to or from many thousands of other sites.

From the set of 43 million sites, I extracted all those with at least 100,000 inbound or outbound links; in other words, the criterion for inclusion in my sample was \(\min(indegree, outdegree) \ge 100,000\). It turns out that just 112 sites qualify. In the diagram below, they are grouped according to their top-level domain (com, org, de, and so on). The size of the colored dot associated with each site encodes the total number of links; the color indicates the percent of those links that are incoming. Hover over a site name to see the inbound, outbound and bidirectional links between that site and the other members of this elite 112. (The diagram was built with Mike Bostock’s d3.js framework, drawing heavily on this example.)

Patience, please . . .

The bright red dots signify a preponderance of outgoing links, with relatively few incoming ones. Many of these sites are directories or catalogs, with lists of links classified by subject matter. Such “portal sites” were popular in the early years of the Web, starting with the World Wide Web Home at CERN, circa 1994; another early example was Jerry and David’s Guide to the World Wide Web, which evolved into Yahoo. Search engines have swept aside many of those hand-curated catalogs, but there are still almost two dozen of them in this data set. Curiously, the Netherlands and Germany (nl and de) seem to be especially partial to hierarchical directories.

Bright blue dots are rarer than red ones; it’s easier to build a site with 100,000 outbound links than it is to persuade 100,000 other sites to link to yours. The biggest blue dot is for wordpress.org, and I know the secret of that site’s popularity. If you have a self-hosted WordPress blog (like this one), the software comes with a built-in link back to home base.

Another conspicuous blue dot is gmpg.org, which mystified me when I first noticed that it ranks fourth among all sites in number of incoming links. Having poked around at the site, I can now explain. GMPG is the Global Multimedia Protocols Group, a name borrowed from the Neal Stephenson novel Snow Crash. In 2003, three friends created a real-world version of GMPG as a vehicle for the XHTML Friends Network, which was conceived as a nonproprietary social network. One of the founders was Matt Mullenweg, who was also the principal developer of WordPress. Hence every copy of WordPress includes a link to gmpg.org. (The link is in the <head> section of the HTML file, so you won’t see it on the screen.) At this point GMPG looks to be a moribund organization, but nonetheless more than a million web sites have links to it.

Networkadvertising.org is the web site of a trade group for online advertisers. Presumably, its 143,863 inbound links are embedded in ads, probably in connection with the association’s opt-out program for behavioral tracking. (To opt out, you have to accept a third-party cookie, which most people concerned about privacy would refuse to do.)

Still another blue-dot site, miibeian.gov.cn, gets its inward links in another way. If I understand correctly, all web sites hosted in China are required to register at miibeian.gov.cn, and they must place a link back to that site on the front page. (If this account is correct, the number of inbound links to miibeian.gov.cn tells us the number of authorized web sites in China. The number in the 2012 data is 289,605, which seems low.)

One final observation I find mildly surprising: Measured by connectivity, these 112 sites are the largest on the entire Web, and you might think they would be reasonably stable over time. But in the three years since the data were collected, 10 percent of the sites have disappeared altogether: Attempts to reach them either time out or return a connection error. At least a few more sites have radically changed their character. For example, serebella.com was a directory site that had almost 700,000 outbound links in 2012; it is now a domain name for sale. Among web sites, it seems, nobody is too big to fail.

The table below lays out the numbers for the 112 sites. It’s sortable: Click on any of the column headers to sort on that field; click again to reverse the ordering. If you’d like to play with the data yourself, download the JSON file.

site inlinks outlinks total links % inbound




Posted in computing | 7 Comments

A taxing algorithm

My first encounter with the term algorithm did not come from Don Knuth. I learned it from the Internal Revenue Service. Sometime in the late 1960s or early 70s the IRS introduced a redesigned Form 1040 based on a principle borrowed from the world of computing: the algorithm. What this meant, I soon discovered, was that the wording of the instructions would be procedural rather than declarative. Instead of “Your tax is 15 percent of the amount on line 42,” we had “Multiply the amount on line 42 by 0.15 and write the result on line 43.” I had expected something more revolutionary, but at least I expanded my vocabulary.

I’ve filled out a lot of 1040s since then, but until yesterday I had never become acquainted with Schedule D (Capital Gains and Losses). What a treat I had waiting for me! Tucked inside the Schedule D instruction book, I found a marvel of labyrinthine arithmetic and logic. The tax worksheet on pages D-15 and D-16 might well be the most complex algorithm that ordinary people are ever asked to perform.

Below is my attempt to represent the worksheet as a data-flow diagram. Inputs (mostly dollar amounts copied from elsewhere in the tax return) are in blue; the eventual output (tax owed) is in red; the flow is mostly from bottom to top. The numbers in the boxes correspond to line numbers in the worksheet.

data-flow diagram for IRS Schedule D tax worksheet

The directed graph has 82 nodes and 96 edges. Twelve subtractions, seven additions, four multiplications, ten comparisons, and two table lookups. Now that’s an algorithm! It’s gnarlier than calculating the date of Easter.

What are the chances that I correctly followed all the steps of this algorithm? What are the chances that the published algorithm correctly implements the intent of the tax code?

Posted in computing, modern life | 12 Comments

Mrs. Nguyen’s prestidigitation

From a set of 1 through 9 playing cards, I draw five cards and get cards showing 8, 4, 2, 7, and 5. I ask my 6th graders to make a 3-digit number and a 2-digit number that would yield the greatest product. I add, “But do not complete the multiplication — meaning do not figure out the answer. I just want you to think about place value and multiplication.”box model of 3 x 2 digit multiplication

The problem above comes from Fawn Nguyen, who teaches math at a public school in southern California and writes a blog called Finding Ways. To her students she is “Mrs. Win,” an English approximation to the Vietnamese pronunciation.

It’s clear that much care and craftsmanship went into formulating this problem. Why did Nguyen choose a two-digit-by-three-digit product, rather than two-by-two or three-by-three? The asymmetry ensures a unique solution. Why did she use playing cards to select the digits, rather than simply asking the kids to call out numbers? The cards produce a set of distinct digits, without duplicates, and they also rule out the possibility of choosing zero. Repeated digits or zeros would not ruin the problem, but they would needlessly complicate it. Nguyen’s classroom procedure eliminates these distractions without even mentioning them. This is the kind of subtle indirection you find in the performance of a first-rate stage magician.

I’ve had some serious fun with Nguyen’s problem. Finding the answer is not difficult—especially if you cheat and set aside her boldfaced injunction forbidding arithmetic. But of course the answer itself isn’t the point, as Nguyen makes clear in her blog post describing the students’ search for a solution. What we really care about is why one particular arrangement of those digits yields a larger product than all other arrangements. We want an explanatory pattern or principle, a rule that works not just for this one set of digits but for any instance of the max-product problem. I had many stumbles on the path to finding such a rule. Here are some notes I made along the way. But before reading on you might want to write down your own best guess at the answer.


  1. First, some notation. Let’s label the digits like so:

    \[\begin{align}
    x_{2}\; x_{1}\; x_{0}&{}\\
    \underline{\times \quad y_{1} \; y_{0}}&{},
    \end{align}\]

    where the subscripts encode the power of 10 associated with each digit. The object is to find a one-to-one mapping between the sets \(\{x_{2}, x_{1}, x_{0}, y_{1}, y_{0}\}\) and \(\{8, 4, 2, 7, 5\}\) that maximizes the product.

  2. Nguyen’s sixth graders were quick to perceive one important principle: Larger digits should generally go to the left, and smaller digits to the right, in order to get the most benefit from place values in decimal notation. No one proposed 258 × 47 as the maximum product; that combination is obviously beaten by 852 × 74.
  3. If we wanted to solve this problem by exhaustive search, how exhausting would the search be? There are \(5! = 120\) permutations of the five digits, and each permutation corresponds to a different multiplication problem. But in view of the observation made in item 2, the number of plausible candidates is much smaller: just \(\binom{5}{2} = \binom{5}{3} = 10\). You could check them all in a matter of minutes. (But Mrs. Nguyen would not approve.)
  4. Again following the logic of item 2, we can stipulate that the 8—the largest of the five digits—must wind up either in position \(x_2\) or in position \(y_1\). The question is which. My first thought was that surely the 8 goes in the three-digit number, because position \(x_2\) gives the 8 a value of 800, whereas it’s only worth 80 as \(y_1\).
  5. My second thought undermined my first thought. Suppose the problem had been formulated without Mrs. Nguyen’s clever card trick, and we had to work with the digits 8, 7, 0, 0, 0. Then the arrangement yielding the maximum product is surely either 800 × 70 or 700 × 80. But of course those two products are both equal to 56,000. In other words, if we consider just the leading digits, it doesn’t matter which goes in the multiplier and which in the multiplicand.
  6. Let’s write out the full six-term polynomial defining the product:\[1000 x_{2}y_{1} + 100 x_{1}y_{1} + 10 x_{0}y_{1} + 100 x_{2}y_{0} + 10 x_{1}y_{0} + x_{0}y_{0}\]The leading term provides another way of understanding why the largest digit can be either \(x_2\) or \(y_1\). It’s just the commutative law. We get \(1000 x_{2}y_{1}\) with either ordering. Thus we need to consider the smaller terms to decide which placement is better. Hint: \(y_1\) appears in three terms but \(x_2\) turns up in only two.
  7. Maybe it would help to look at a smaller version of the problem? For example, maximize the product \(x_{1}\, x_{0} \times y_{0}\) with digits drawn from the set \(\{1, 2, 3\}\). Where does the largest digit belong?
  8. Here’s a vague, daydreamy notion: The task of maximizing the product of these numbers looks something like an isoperimetric problem. If you have a rectangle of perimeter \(2(h + w)\), you maximize the area \(hw\) by making \(h=w\), so that the rectangle becomes a square. Maybe, by analogy, we should make the three-digit and the two-digit numbers as close to equal as possible, while at the same time making both of them as large as possible.
  9. At this point I was ready to commit. I was persuaded by the arguments in items 6 and 7, and 8 that the largest digit should be bound to variable \(y_1\), and the next largest to \(x_2\). Then the same rule could be applied recursively to the remaining digits and variables, yielding this assignment:

    \[\begin{align}
    7\, 4\, 2&{}\\
    \underline{\times \quad 8 \, 5}&{}\\
    6\, 3\, 0\, 7\, 0&{}
    \end{align}\]

  10. Of course you already know I was wrong. Even if you haven’t yet solved the problem for yourself or looked up the answer elsewhere, it’s a well-established principle of proper storytelling that no one ever stumbles on the right answer on the first try.

    My error was revealed by a program running an exhaustive-search algorithm. It showed that exchanging the positions of the 7 and the 8 yields a larger product:

    \[\begin{align}
    8\, 4\, 2&{}\\
    \underline{\times \quad 7 \, 5}&{}\\
    6\, 3\, 1\, 5\, 0&{}
    \end{align}\]

    But that isn’t the right answer either. Instead of switching the 7 and 8, you can exchange the 5 and the 4 to get the following result, which does turn out to be optimal:

    \[\begin{align}
    7\, 5\, 2&{}\\
    \underline{\times \quad 8 \, 4}&{}\\
    6\, 3\, 1\, 6\, 8&{}
    \end{align}\]


So that’s it—the answer we’ve been seeking, the maximum product, the solution to Mrs. Nguyen’s problem. There’s no arguing with arithmetic.

But it’s hardly the end of the trail. Why does that peculiar permutation of the digit set gives the biggest product? Does the same pattern work for other two-digit-by-three-digit products? What about problems of other shapes and sizes? And what is the pattern, anyway? How would you succinctly state the rule for placing digits in Mrs. Nguyen’s boxes?

In trying to make sense of what’s going on here, I’ve found a certain graphic device to be helpful. I call it a lacing diagram, because it reminds me of the various schemes for lacing up a pair of sneakers. The patterns are easier to perceive with larger numbers of digits (i.e., high-top sneakers), so the examples given below show sets of 10 digits arranged to form a five-by-five multiplication problem.

In a lacing diagram the red arrows trace a path that threads its way through all the digits, ordered from largest to smallest. Each segment of this path can either cross from one factor to the other (a trans step) or stay within the same factor (a cis step). The particular lacing shown here is made up of alternating trans and cis segments. The sequence of t’s and c’s below the diagram serves as a linearized encoding of the pattern; the number below the encoding is the product of the two factors.

Tctctctct 200

Here is the lacing diagram corresponding to the pattern that I initially believed would prevail in all Nguyen products:

Ttttttttt 200

In this case, all the steps are trans, as the arrows zigzag from one side to the other. The linear encoding consists of nine t’s.

And finally here is the lacing diagram for the winning permutation, the arrangement that gives the largest product. The pattern is the same as the second lacing diagram above—the zigzag—except for a single twist: The leftmost digits of the two factors have been switched, and so there is one cis step in the path.

lacing pattern tcttttttt

After a bit of puzzling, I was able to work out why that single twist yields a better score than the plain zigzag. It’s because \((9 \times 7) + (8 \times 6) = 111\), whereas \((9 \times 6) + (8 \times 7) = 110\). Likewise \((9 \times 5) + (8 \times 4) = 77\), whereas \((9 \times 4) + (8 \times 5) = 76\), and so on. Note the difference between the twisted and the zigzag products: \(8439739020\, – 8428629020 = 11110000\). Each of the four pairings of the leftmost digits with those to their right contributes a 1 in that difference.

If one twist is a good thing, maybe more twists would be even better? For example, if we invert the 5 and the 4, we get \((5 \times 3) + (4 \times 2) = 23\) instead of \((5 \times 2) + (4 \times 3) = 22\), again for a net gain. But of course there’s a flaw in this strategy. Flipping the 5 and the 4 increases their products with neighboring digits to the right, but lowers those with digits to the left. Flipping the leftmost digits doesn’t run afoul of this rule because there are no neighbors to the left.

For a more explicit definition of the zigzag-with-a-twist arrangement, here is a Python implementation. The basic idea is to deal out the digits alternately to the \(x\) and \(y\) factors, starting with \(x\) and working through the digits in descending order. When \(y\) (the smaller factor) runs out of room, any remaining digits are tacked onto the end of \(x\). Finally—this is the twist—the leftmost \(x\) and \(y\) digits are swapped. This procedure works in any base.

def twisted_zigzag(digit_set, s):
    s = min(s, len(digit_set) - s)    # len of smaller factor
    digits = sorted(list(digit_set))  # smallest to largest
    x = []
    y = []
    while digits:                     # pop from end of list
        x.append(digits.pop())        # start with x
        if len(y) < s:                # zigzag until y is full
            y.append(digits.pop())
    x[0], y[0] = y[0], x[0]           # swap first digits
    return x, y

Does the zigzag-with-a-twist arrangement give the maximum product for all possible Nguyen-type problems? I can offer substantial evidence supporting that proposition. For base-10 numbers formed without repeated digits there are 4,097 two-factor multiplication problems with digits sorted in descending order. The zigzag-twist pattern gives the correct result for all of them.For digit sets that include 0, the solution is not always unique. Indeed, the algorithm works for all problems in bases between 2 and 15. It’s hardly a daring conjecture to suggest that it works in all bases, but I have not produced a proof. Maybe Mrs. Nguyen’s sixth graders will do so.

Posted in mathematics | 7 Comments

A quasirandom talk

I’ll be giving a talk at Harvard this Friday, February 6: “Orderly Randomness: Quasirandom Numbers and Quasi–Monte Carlo.” If you’re in the neighborhood, please stop by. Lunch at 12:30; quasirandomness at 1:00 pm.

Update 2015-02-08: Slides online. And video.

Posted in computing, mathematics | 1 Comment

600613

Pick a number, N, then try searching for it on the web via Bing or Google (or maybe the leet version of Google). What can you expect to learn? I wasn’t quite sure of the answer, so I ran some experiments.

When N is a small positive integer—less than 100, say—the leading results tend to be mass-audience web pages that happen to display the numeral N in some prominent way, such as in a headline or a title. There are news stories (Packers 43, Falcons 37), TV stations (WXMI Fox 17), a few brand names (Motel 6), references to iconic events (9/11, Apollo 13), listings of Bible verses (Romans 3:23).

With somewhat larger integers—three or four digits—I see a lot of street addresses, area codes, tax forms, statutes and ordinances. With five-digit numbers, Zip codes become prominent. At six digits we enter the land of hex colors, accompanied by a baffling variety of part numbers, account numbers, serial numbers, patent numbers, error numbers, lottery numbers. With a search string of 8 to 10 digits, telephone directories dominate the results. Still further out on the number line, you eventually come to a numerical desert where Google and Bing usually come up empty.


To get a more quantitative sense of how numbers are distributed across the web, I decided to do some sampling. I randomly selected 2,000 positive integers of 1 to 12 decimal digits, and submitted them to Google as web search queries. To construct the query integers I started with 2,000 floating-point numbers selected uniformly at random (with replacement) from the range \(0 \le m \lt 12\). For each \(m\) I calculated \(N = \lfloor 10^{m}\rfloor\), then ran a Google search for N. The work was done by a Python script with a politeness pause of one second between queries.From the results of each search I extracted \(H(N)\), the number of hits, which Google reports near the top of the page. Here’s what I found, plotted on log-log scales:

Google hits 12 digit

What an intriguing graph! Over most of the range in the log-log plot, the broad trend looks very nearly linear. What does that mean? If the Google data accurately reflect the state of the web, and if my sampling of the data can be trusted, it means the number of web pages mentioning numbers of magnitude \(10^k\) is roughly constant for all k in the range from \(k = 2\) to \(k = 10\). I don’t mean to suggest that specific large numbers appear just as frequently as specific small numbers. That’s obviously untrue: A typical two- or three-digit number might be found on a billion web pages, whereas a specific nine- or ten-digit number is likely to appear on only one or two pages. But there are only 90 two-digit numbers, compared with 90 billion 10-digit numbers, so the overall number of pages in those two classes is approximately the same.

Here’s another way of saying the same thing: The product of \(N\) and \(H(N)\) is nearly constant, with a geometric mean of roughly \(7 \times 10^{10}\). An equivalent statement is that:

\[\log_{10}{N} + \log_{10}{H(N)} \approx 10.86.\]

You can visualize this fact without doing any arithmetic at all. Just print a series of \(N, H(N)\) tuples in a column and observe that the total number of digits in a tuple is seldom less than 11 or greater than 13.

    N,        H(N)
    96964835, 2120
    2048, 164000000
    476899618, 214
    96416, 374000
    75555964, 3020
    171494, 182000
    154045436, 2160
    1206, 112000000
    761088, 50200
    7500301034, 24
    13211445, 10900
    1289, 77000000
    1507549, 18100
    3488, 3330000
    7507624475, 10
    17592745, 2830
    1430187656, 30
    691, 265000000
    41670244642, 2
    326, 52900000

Although the vast majority of the 2,000 data points lie near the 10.86 “main sequence” line, there are some outliers. One notable example is 25898913. Most numbers of this magnitude garner a few thousand hits on Google, but 25898913 gets 29,500,000. What could possibly make that particular sequence of digits 10,000 times more popular than most of its neighbors? Apparently it’s not just an isolated freak. About half the integers between 25898900 and 25898999 score well below 10,000 hits, and the other half score above 20 million. I can’t discern any trait that distringuishes the two classes of numbers. Sampling from other nearby ranges suggests that such anomalies are rare.


A straight line on a log-log plot often signals a power-law distribution. The classic example is the Zipfian distribution of word frequencies in natural-language text, where the kth most common word can be expected to appear with frequency proportional to \(k^{-\alpha}\), with \(\alpha \approx 1\). Does a similar rule hold for integers on the web? Maybe. I tried fitting a power law to the data with the powerlaw Python package from Jeff Alstott et al. The calculated value of \(\alpha\) was about 1.17, which seems plausible enough, but other diagnostic indicators were not so clear. Identifying power laws in empirical data is notoriously tricky, and I don’t have much confidence in my ability to get it right, even with the help of a slick code library.

I’m actually surprised that the pattern in the graph above looks so Zipfian, because the data being plotted don’t really represent the frequencies of the numbers. Google’s hit count \(H(N)\) is an approximation to the number of web pages on which \(N\) appears, not the number of times that \(N\) appears on the web. Those two figures can be expected to differ because a page that mentions \(N\) once may well mention it more than once. For example, a page about the movie 42 has eight occurrences of 42, and a page about the movie 23 has 13 occurrences of 23. (By the way, what’s up with all these numeric movie titles?)

Another distorting factor is that Google apparently implements some sort of substring matching algorithm for digit strings. If you search for 5551212, the results will include pages that mention 8005551212 and 2125551212, and so on. I’m not sure how far they carry this practice. Does a web page that includes the number 1234 turn up in search results for all nonempty substrings: 1234, 123, 234, 12, 23, 34, 1, 2, 3, 4? That kind of multiple counting would greatly inflate the frequencies of numbers in the Googleverse.

It’s also worth noting that Google does some preprocessing of numeric data both in web pages and in search queries. Commas, hyphens, and parentheses are stripped out (but not periods/decimal points). Thus searches for 5551212, 555-1212, and 5,551,212 all seem to elicit identical results. (Enclosing the search string in quotation marks suppresses this behavior, but I didn’t realize that until late in the writing of this article, so all the results reported here are for unquoted search queries.)


In the graph above, the linear trend seems to extend all the way to the lower righthand corner, but not to the upper lefthand corner. If we take seriously the inferred equation \(N \times H(N) = 7 \times 10^{10}\), then the number of hits for \(N = 1\) should obviously be \(7 \times 10^{10}\). In fact, searches for integers in the range \(1 \le N \le 25\) generally return far fewer hits. Many of the results are clustered around \(10^{7}\), four or five orders of magnitude smaller than would be expected from the trend line.

To investigate this discrepancy, I ran another series of Google searches, recording the number of hits for each integer from 0 through 100. Note that in this graph the y axis is logarithmic but the x axis is linear.

Google hits 0 100

There’s no question that something is depressing the abundance of most numbers less than 25. The abruptness of the dip suggests that this is an artifact of an algorithm or policy imposed by the search engine, rather than a property of the underlying distribution. I have a guess about what’s going on. Small numbers may be so common that they are treated as “stop words,” like “a,” “the,” “and,” etc., and ignored in most searches. Perhaps the highest-frequency numbers are counted only when they appear in an <h1> or <h2> heading, not when they’re in ordinary text.

But much remains unexplained. Why do 2, 3, 4, and 5 escape the too-many-to-count filter? Same question for 23. What’s up with 25 and 43, which stand more than 10 times taller than their nearest neighbors? Finally, in this run of 101 Google searches, the hit counts for small integers are mostly clustered around \(10^6\), whereas the earlier series of 2,000 random searches produced a big clump at \(10^7\). In that earlier run I also noticed that searching repeatedly for the same \(N\) could yield different values of \(H(N)\), even when the queries were submitted in the space of a few seconds. For example, with \(N=1\) I saw \(H(N)\) values ranging from 10,400,000 to 1,550,000,000. Presumably, the different values are coming from different servers or different data centers in Google’s distributed database.

I was curious enough about the inconsistencies to run another batch of random searches. In the graph below the 2,000 data points from the first search are light blue and the 2,000 new points are dark blue.

Google hits 12 digit ccombo

Over most of the range, the two data sets are closely matched, but there’s a conspicuous change in the region between \(10^2\) and \(10^4\). In the earlier run, numbers in that size range were split into two populations, with frequencies differing by a factor of 10. I was unable to identify any property that distinguishes members of the two populations; they are not, for example, just odd and even numbers. In the new data, the lower branch of the curve has disappeared. Now there is a sharp discontinuity at \(N = 10^4\), where typical frequency falls by factor of 10. I have no idea what this is all about, but I strongly suspect it’s something in the Google algorithms, not in the actual distribution of numbers on the web.


The limitations of string matching—or even regular-expression matching—are more troublesome when you go beyond searching for simple positive integers. I’ve hardly begun to explore this issue, but the following table hints at one aspect of the problem.

N top hit
17.3 HP Anodized Silver 17.3″ Pavilion
17.30 17.30j Syllabus – MIT
17.300 Chapter 17.300 COMPLIANCE
17.3000 Map of Latitude: 17.3000, Longitude: -62.7333
17.30000 41 25 0 0 2.000000 4.000000 6.000000 8.000000
17.300000 17.300000 [initially -35.600000] gi_24347982 (+) RNA

Search queries that are mathematically equal (when interpreted as decimal representations of real numbers) yield quite different results. And 4.999… is definitely not equal to 5.000… in the world of Google web search.

It gets even worse with fractions. A search for 7/3 brought me a calculator result correctly announcing that “7/3 = 2.33333333333″ but it also gave me articles headlined “7^3 – Wolfram Alpha”, “Matthew 7:3″, “49ers take 7-3 lead”, and “Hokua 7’3″ LE – Naish”. (Enclosing the search term in quotation marks doesn’t help in this case.)


Before closing the book on this strange numerical diversion that has entertained me for the past couple of weeks, I want to comment on one more curious discovery. If you run enough searches for large numbers, you’ll eventually stumble on web sites such as numberworld, numberempire, numbersbase, each-number, every-number, all-numbers, integernumber, numbersaplenty, and numberopedia. A few of these sites appear to be created and curated by genuine number-lore enthusiasts, but others have a whiff of sleazy search-engine baiting. (For that reason I’m not linking to any of them.)

Here’s part of a screen capture from Numbers Aplenty, which is one of the more interesting sites:

NumbersAplenty screen

Each of the numbers displayed on the page is a link to another Numbers Aplenty page, and the site is apparently equipped to display such a page for any positive integer less than \(10^{16}\). A few years ago, Google reported that they had indexed a trillion unique URLs on the world wide web. Evidently they hadn’t yet worked their way through the 10,000 trillion URLS at Numbers Aplenty. (But I’m pretty sure the server doesn’t have \(10^{16}\) HTML files stored on disk, patiently waiting for someone to request the information.


And, finally, a trivia question: What is the smallest positive integer for which a Google search returns zero results? The smallest I have recorded so far is 10,041,295,923. (Of course that could change after the Googlebot indexes this page.) Can anyone find an example with 10 or fewer digits?


Update 2014-12-22. Commenter Samuel Bierwagen wrote:

The hits number on the first page is very inaccurate, frequently off by several orders of magnitude. To get better results you have to go to the second or third page.

I’ve now given this idea a try. Whenever the estimated hit count is at least 1 million, I repeat the search, appending “start=20″ to the query string. This has the effect of requesting the third page of results (i.e., results 20 through 29). Here’s the outcome:

Google hits 12 digit p3

Light blue dots are from earlier surveys (4,000 points in all). Dark blue dots are from the new survey, with the page-three request installed. There’s a dramatic change in the hit counts \(H(N)\) for \(N \lt 100\). For these small \(N\), Google’s first-page hit count fluctuates wildly and is often near \(10^{7}\). The third-page results are higher and much more consistent. Indeed, all \(N \lt 14\) returned exactly the same hit count: 25,270,000,000. This uniformity suggests that we’re still seeing some sort of filtering in the results–I suspect Google may be trying to keep secret the overall size of their index–but at least the trend line is now monotonic.

In another comment, Brian J. Peterson mentions seeing HTTP results with an error code 503 (service temporarily unavailable). I had not encountered any such errors in my earlier search series, but I did see some in this latest run (86 errors out of 2,000 searches). My best guess is that a request for page 3 may occasionally take more than 1 second, so that the transaction hasn’t completed when the next search is initiated.

Meanwhile, I have learned of another study of number prevalence on the web with a much better data source than Google hit counts. In a short paper from the 2014 World Wide Web companion conference, Willem Robert van Hage and two colleagues used the Common Crawl web archive to measure number frequencies. They looked at real numbers, not just integers, and I’m not sure how to compare their results with mine. My main response to seeing this work is that the Common Crawl is an amazing resource–each of us can build a Google of our own–and I want to spend some time next year playing with it.

Posted in computing, mathematics | 20 Comments

Four Fifths = A Fifth

Browsing in the library stacks the other day, I came upon this passage in a 1970 paper discussing the uses of computers in mathematics:

Lander and Parkin's counter-example [26] to Euler's conjecture in the theory of diophantine equations that no nth power equals the sum of less than n nth powers---namely, that 144^5 = 27^5 + 85^5 + 110^5 + 133^5 is highly satisfying, but does not contribute to the conceptual framework of mathematical activity.

Notice anything fishy about that equation? On the right side of the equal sign we have an odd number of odd numbers, and so their sum must be odd; but the power of 144 on the left side is even.

I’m not clever enough to have caught that parity violation on first glance; I noticed it only later. On the other hand, my experience as an editor has taught me never to trust an author’s arithmetic. I had a computer sitting in front of me, open to an IPython notebook, so I typed in the numbers:

IPython arithmetic

Had I just discovered that a 50-year-old counterexample to a 250-year-old conjecture is not in fact a counterexample after all? Surely not. Someone must have erred in copying the equation from the original source.

So I decided to track down that source. The reference in the quoted passage directed me to a 1966 paper by L. J. Lander and T. R. Parkin, which I quickly found online. However, the reference is mistaken; the cited paper says nothing at all about fifth powers and has no equation resembling the one above.


A faulty equation and a wayward reference within a single paragraph: That’s not so good. But I’ve done worse myself, and my purpose here is not to criticize an errant author (whom I’m not going to name). What interests me is how one might go about correcting the error in the equation and finding the true counterexample to the Euler conjecture. Here are some strategies I might have tried that morning:

  • I was sitting in a university library. I had all the apparatus of scholarly research at my fingertips. I could turn to MathSciNet, say, and identify the correct paper by Lander and Parkin.
  • I was sitting at a computer with a wifi connection. I had all the resources of the Internet at my fingertips. I could Google it, or consult some more specialized database such as the OEIS.
  • I could just guess the answer. On the assumption that exactly one of the five terms has been altered, I would expect to come up with the correct equation after just a few tries.
  • I could start from scratch. I could write a program to repeat the computer search that Lander and Parkin evidently used to discover the equation in the first place.
  • I could bring the might of mathematics to bear on the problem. That is, I could try to “solve the equation.”

At this point you might want to stop reading, choose your own favorite method, and give it a try. The answer, once you have it, is not all that illuminating, but I found the process of searching had its moments of interest.

If guessing is one of your strategies, you had better try it first. I already had the equation entered into the IPython notebook, where I could easily alter a value and observe the result. (A spreadsheet would serve as well in this role.) I started with the \(27^5\) term, reducing it to \(26^5\), then \(25^5\), and so on, but the effect was too small. Even setting that term to zero or to \((-27)^5\) left an excess, so I moved on to the \(85^5\) term. Proceeding in this way, it didn’t take long to pinpoint the error.

The reference is: L. J. Lander and T. R. Parkin, 1966, Counterexample to Euler’s conjecture on sums of like powers, Bulletin of the American Mathematical Society 72:1079. The text of this paper consists of five lines, one of which is the equation.Googling also produced useful results, including pages on Wikipedia and MathWorld that give the correct equation and the correct reference. But finding those pages was not straightforward, and the experience has left me wondering how best to search for numerical or mathematical content on the web. You can always go after the metadata; in this case the names Lander, Parkin, and Euler are good choices, as well as the terms “nth power” and “conjecture.” You might also get lucky searching for character strings such as “144^5” and “27^5”, but here the outcome is hit or miss. There are many ways of encoding mathematics in digital documents—TeX, MathML, the ASCII pidgin that evolved years ago on Usenet, notations based on programming languages. Without some higher-level semantic processing, no single search string will find them all. Some web sites (including MathWorld) turn equations into tiny GIF images, making search even more difficult.

A question I haven’t been able to answer through web search is whether the erroneous equation from that 1970 paper has propagated into the subsequent literature. I have found no evidence of it, but I have little confidence in the ability of current search technology to answer the question reliably. Gathering quantitative data—how many times does the equation appear on the web?—also seems to be out of reach.

The issue goes beyond Google and the other web search engines. MathSciNet claims to accept search strings in TeX format, but I’ve never figured out how to make it work. As for the OEIS, the search function there is specialized for finding integer sequences—as one might expect in an Online Encyclopedia of Integer Sequences. Searching for an equation is an “off label” use of the archive. Still, I was sure I’d find it. I tried searching for 61917364224 (which is \(144^5\)), and got 47 hits. (It’s a popular number because it’s a product of many small factors, \(2^{20} 3^{10}\), and is also a Fibonacci number.) However, none of those 47 sequences has anything to do with the Euler sum-of-powers conjecture or the Lander-Parkin counterexample. Yet the equation I was looking for does indeed appear in the OEIS, namely in sequence A134341.

The next approach—redoing the entire computation from scratch—turned out to be easier than I expected. Lander and Parkin ran their search on a CDC 6600, which was the most powerful computer of its time, the world’s first megaFLOPS machine, designed by Seymour Cray early in his career. I don’t know how long the computation took on the 6600, but there’s not much to it on a modern laptop. Here’s some Python code:

import itertools as it

def four_fifths(n):
    """Return smallest positive integers ((a,b,c,d),e) such that
       a^5 + b^5 + c^5 + d^5 = e^5; if no such tuple exists
       with e < n, return the string 'Failed'."""
    fifths = [x**5 for x in range(n)]
    combos = it.combinations_with_replacement(range(1,n), 4)
    while True:
        try:
            cc = combos.next()
            cc_sum = sum([fifths[i] for i in cc])
            if cc_sum in fifths:
                return(cc, fifths.index(cc_sum))
        except StopIteration:
            return('Failed')

The first step here is to precalculate the fifth powers of all integers less than a given limit, n. Then we generate all combinations of four integers in the range from 1 to n, starting with the 4-tuple (1, 1, 1, 1) and continuing to (n, n, n, n). For each combination we look up the fifth powers of the integers and check to see whether their sum is also present in the precomputed vector of fifth powers. Invoking this program as four_fifths(150) returns the correct answer in about 30 seconds. A little sad, isn’t it? This was once a research project fit for a supercomputer, and now it’s reduced to the level of a homework assignment.

The final tool we might apply to this problem is the hammer of mathematics. I would put the question as follows. We are looking for integer solutions to this equation:

\[a^5 + b^5 + c^5 + d^5 - e^5 = 0.\]

In that quest, do we gain any significant leverage by knowing that:

\[27^5 + 85^5 + 110^5 + 133^5 - 144^5 = 254933701 ?\]

Can we take the number 254933701 and apply some algebraic or analytic hocus pocus that will lead us directly to the correct values of a, b, c, d, e? If we make no further assumptions or guesses, I believe the answer must be no. After all, there are innumerable values for a, b, c, d, e that we can plug into our equation to get something other than 0. Consider this example:

\[30^5 + 69^5 + 82^5 + 86^5 - 100^5 = 43.\]

That’s a very near miss—it comes within 5 parts per billion of being a true solution—and yet it tells us nothing about the set of correct solutions except that this isn’t one of them.

An erroneous equation seems to be useful only if we know quite a lot about the nature of the errors. For example, in the faulty equation from 1970, if someone tips you off that all the terms are correct except for \(85^5\), you can form the equation

\[85^5 - x^5 = 254933701\]

and solve for \(x\). If \(x\) turns out to be an integer, you have your answer. And indeed this is the case:

\[x = \sqrt[5]{4182119424} = 84.\]

But is this procedure better than guesswork? Is it different from guesswork?

Just for the record, the correct Lander-Parkin equation is:

\[27^5 + 84^5 + 110^5 + 133^5 = 144^5 = 61917364224.\]


After discovering their counterexample to the Euler conjecture, Lander and Parkin, together with John Selfridge, made some conjectures of their own. Suppose an nth power can be partitioned into k smaller nth powers. They conjectured that k is never less than \(n-1\). And they asked whether there always exists an example where \(k = n-1\). You might try to settle the latter question for the case of \(n = 6\). Can you find a set of five 6th powers that add up to another 6th power? For this problem, Google won’t help you, because the answer is unknown. No one has even identified a set of six 6th powers that sum to a 6th power. If you approach this task as a computational challenge, it’s rather more than a homework assignment.

Posted in computing, mathematics | 8 Comments

Lopsided

Sifting through the election results last week, I noticed that the precinct where I used to live in Durham, North Carolina, voted 620 to 40 in favor of the Blue candidate in the U.S. Senate race. That’s a margin of just under 94 percent. A few nearby polling places were even more lopsided: One score was 288 to 7, which works out to 97.6 percent. Statewide, however, the contest was quite close; the Blues lost with 49.13 percent to the Reds’ 50.87 percent. (The percentages I’m giving here are based on votes for the two major parties only; published results are slightly different because they include votes for third-party candidates and write-ins.)

The combination of one-sided local results and a nearly even split in the statewide totals left me curious about the distribution of vote margins, or what we might call the political polarization spectrum. Are extreme ratios, like those in my old neighborhood, rare outliers? Or have we become a nation of segregated political communities, where you live with your own kind and stay away from places where the other side dominates?

The infographics group at The New York Times has produced a series of election maps that offer much geographic insight.

Portion of a New York Times map showing voting patterns in the 2014 U.S. Senate election in North Carolina. Durham is the blue inkblot near the right edge. Graphics by Amanda Cox, Mike Bostock, Derek Watkins, and Shan Carter.NY Times map showing Senate vote proportions and population density for a section of the state extending from Greenboro on the left to Durham on the right.

The North Carolina map shows deep blue city cores surrounded by ruddy or purplish suburbs. Rural areas, with lower population density, range from pink to baby blue. The maps are lovely—they transform ugly politics into luscious cartography—but they can’t give a quantitative answer to my question about political segregation and integration.

So off I went to the web site of the North Carolina Board of Elections, where I pulled down the latest report (still unofficial), giving vote totals for the state’s 2,726 precincts. After some vigorous data wrangling (Excel, Textmate, Python, Lisp, eyeballing), I had a histogram with 20 bins, classifying precincts according to their percentage of votes for the Red candidate.

Before I show the data, it’s worth pausing to think about what the distribution might be expected to look like. Suppose the state’s population were thoroughly mixed, with Red and Blue voters scattered at random. Then the precinct voting margins would follow a normal distribution, with most of the precincts near the 50-50 mark and only a few out in the far-right and far-left tails. The curve might look something like this:

Normal bars

Actually, that’s not quite what the curve would look like; it would be much narrower. Suppose you take 3,000 precincts with 1,000 voters each, and you paint the individual voters Red or Blue at random with probability 1/2. You would get a normal distribution with a mean of 500 and a standard deviation of about 16 voters (i.e., \(1/2\sqrt{1000}\)). Almost all of the precincts would lie within three standard deviations of the mean, or in other words between 450 and 550 Red voters. The histogram would be more like this:

Narrow normal bars

Of course nobody who reads the newspaper would suggest that random mixing provides a good model of the American political landscape. Given the evidence of increasing polarization in the public, we might expect to find a two-humped camel, with a Red peak and a Blue peak, and not much of anybody in the middle:

Bimodal bars

Or maybe the distribution is even more extreme: an empty bowl, with every precinct drifting centrifugally toward the outer fringes of the political spectrum:

U bars

All right, I’ve stalled long enough; no more playing with toy distributions. Here’s the real deal—the histogram based on North Carolina precinct data for the recent Senate election:

Nc bars

The shape of this curve took me by surprise, mostly because of its strong asymmetry. Remember, the total numbers of Red and Blue voters in the state are roughly equal, but the apportionment of Reds and Blues in precincts turns out to be quite different. At the left edge of the graph there are 150 precincts where at least 90 percent of the voters chose the Blue candidate, but over on the right side there are only 8 precincts that voted at least 90 percent Red.

Please don’t misinterpret this graph. It doesn’t say that Blue voters are more extreme or more radicalized than Red ones. Nothing of the sort. It says that Blue voters tend to huddle together in more homogeneous communities. If you’re Blue and you want to be surrounded by like-minded voters, there are hundreds of places you can live. If you’re Red, you have lots of choices where you’ll be in the majority—more than 60 percent of the precincts satisfy that criterion—but you would have a hard time finding areas where you have few or no Blue neighbors.

The pattern seemed peculiar enough that I began to wonder if it might be unique to North Carolina. So I took a look at the Senate race in Virginia, which was even closer than the North Carolina contest. The Virginia pattern differs only in detail. Again we have an abundance of solidly Blue precincts, but very few solidly Red ones:

Va bars

Could it be something about the South that produces this skew in the curve? I wanted to check the closely contested governor’s race in Massachusetts (where I live now), but the state hasn’t yet posted precinct-level results. Instead I looked at the Minnesota Senatorial election:

Mn bars g

Here the raised shoulder on the Blue side is not nearly as dramatic, but the asymmetry is present. (Note that this race did not have a close finish; the Blues won by 10 percentage points.)


There are several ways we might try to explain these patterns—or explain them away. A glance at the Times maps, with their inky blue urban neighborhoods, Times color keysuggests it’s all about city living. But this impression is partly an artifact of the maps’ graphic scheme, which uses color to encode not just voting margin but also population density. As a result, strongly skewed votes in rural areas are not nearly as conspicuous as those in cities. In North Carolina there are several pale blue precincts in the countryside that have vote totals above 85 percent Blue. Minnesota exhibits the same pattern; indeed, one Iron Range precinct, in the far north of the state, recorded a clean sweep for the Blues, 218 to 0. In Virginia, on the other hand, all of the most strongly biased precincts seem to be urban. And it remains true that the vast majority of voters in lopsided precincts are city dwellers.

It’s also important to note that city districts are geographically much smaller than rural or suburban ones. Perhaps this fact alone could explain much of the Blue-Red difference. If we carved up the rural districts into areas the same size as the city precincts, would we find homogeneous clusters of voters? I have no data to answer that question one way or the other. But even if such clusters do exist, the situation remains asymmetric because the Blue voting blocks are larger in population by an order of magnitude.

Race is another factor that can’t be ignored. Those strongly Blue precincts in rural North Carolina are also largely black precincts. But race is not the whole story. Many of the Bluest urban districts are racially and ethnically diverse.


In the end, what’s most puzzling about the histograms is not just the existence of many pure Blue precincts but the near absence of pure Red ones. What accounts for this imbalance? It’s not hard to imagine social mechanisms that would separate people into affinity groups, but the simplest models are symmetrical. We need to find a factor that acts differently on the two parties.

I have not settled in my own mind what I think is going on here, but I would like to offer three possible mechanisms for consideration.

First, maybe Red voters and Blue voters differ in the criteria they apply when choosing a place to live. We might test this idea in a variant of the Schelling segregation model, in which people tend to go elsewhere when they have too few neighbors of their own kind. Perhaps Red voters refuse to live where the proportion of Reds is less than 1/3, but Blues are content to stay even where they are a tiny minority. Alternatively, we might suppose that when Blues reach a 2/3 majority, they drive out the remaining Reds, whereas Reds are willing to tolerate a Blue minority in their midst. In either case, the result is “Red flight” from strongly Blue areas but no countercurrent of Blue voters fleeing Red areas. I think this model might be capable of explaining the observations, but I have no idea how to explain the model. Is there any evidence for such an asymmetry in personal preferences and behavior?

The second possibility is that Blues are more effective than Reds in persuading their neighbors to support the local political majority. In other words, it’s not that Reds flee or are expelled from Blue areas; rather, they are converted into Blues. Then we have to ask: How come the Reds are unable to win over their own Blue neighbors?

My third candidate explanation is gerrymandering. Maybe what we’re seeing is not some natural tendency to form uniformly Blue communities but rather an attempt to draw the boundaries of precincts in a way that concentrates Blue voters in certain districts, leaving the rest of the precincts with a Red majority. To test this hypothesis we might compare states in which different parties dominate the state legislature and thus control the redistricting process. If the asymmetry really is caused by an attempt to hem in the minority party, we should see a mirror image in states where the other party is in power. The three states I’ve looked at so far are not a useful sample in this respect: North Carolina and Virginia both have Red legislatures, and Minnesota’s was also controlled by the Reds during the most recent redistricting cycle. Massachusetts will be a good test case when the numbers come out.


One last thought: In this essay I have written about Reds and Blues rather than Republicans and Democrats in an attempt to keep the focus on a mathematical question and to keep my distance from partisan passions. For the record, however, I don’t actually believe that politics is a game played by brightly colored teams. And I do take sides. Last Tuesday, in my opinion, the stinkers won.

Posted in social science, statistics | 19 Comments