## The long run

The other day I went over to Danehy Park in Cambridge, which has a fine 400-meter running track. I did four laps at my usual plodding pace. Here is my path, as recorded by an iPhone app called Runmeter:

No, I wasn’t drunk. The blue trace shows me lurching all over the track, straying onto the soccer field, and taking scandalous shortcuts in the turns—but none of that happened, I promise. During the entire run my feet never left the innermost lane of the oval. All of my apparent detours and diversions result from GPS measurement errors or from approximations made in reconstructing the path from a finite set of measured positions.

At the end of the run, the app tells me how far I’ve gone, and how fast. Can I trust those numbers? Looking at the map, the prospects for getting accurate summary statistics seem pretty dim, but you never know. Maybe, somehow, the errors balance out.

Consider the one-dimensional case, with a runner moving steadily to the right along the $$x$$ axis. A GPS system records a series of measured positions $$x_0, x_1, \ldots, x_n$$ with each $$x_i$$ displaced from its true value by a random amount no greater than $$\pm\epsilon$$. When we calculate total distance from the successive positions, most of the error terms cancel. If $$x_i$$ is shifted to the right, it is farther from $$x_{i-1}$$ but closer to $$x_{i+1}$$. For the run as a whole, the worst-case error is just $$\pm 2 \epsilon$$—the same as if we had recorded only the endpoints of the trajectory. As the length of the run increases, the percentage error goes to zero.

In two dimensions the situation is more complicated, but one might still hope for a compensating mechanism whereby some errors would lengthen the path and others shorten it, and everything would come out nearly even in the end. Until a few days ago I might have clung to that hope. Then I read a paper by Peter Ranacher of the University of Salzburg and four colleagues. (Take your choice of the journal version, which is open access, or the arXiv preprint. Hat tip to Douglas McCormick in IEEE Spectrum, where I learned about the story.)

Ranacher’s conclusion is slightly dispiriting for the runner. On a two-dimensional surface, GPS position errors introduce a systematic bias, tending to exaggerate the length of a trajectory. Thus I probably don’t run as far or as fast as I had thought. But to make up for that disappointment, I have learned something new and unexpected about the nature of measurement in the presence of uncertainty, and along the way I’ve had a bit of mathematical adventure.

The Runmeter app works by querying the phone’s GPS receiver every few seconds and recording the reported longitude and latitude. Then it constructs a path by drawing straight line segments connecting successive points.

Two kinds of error can creep into the GPS trajectory. Measurement errors arise when the reported position differs from the true position. Interpolation errors come from the connect-the-dots procedure, which can miss wiggles in the path between sampling points. Ranacher et al. consider only the inaccuracies of measurement, on the grounds that interpolation errors can be reduced by more frequent sampling or a more sophisticated curve-fitting method (e.g., cubic splines rather than line segments). Interpolation error is eliminated altogether if the runner’s path is a straight line.

Suppose a runner on the $$x, y$$ plane shuttles back and forth repeatedly between the points $$p = (0, 0)$$ and $$q = (d, 0)$$. In other words, the end points of the path lie $$d$$ units apart along the $$x$$ axis. After $$n$$ trips, the true distance covered is clearly $$nd$$. A GPS device records the runner’s position at the start and end of each segment, but introduces errors in both the $$x$$ and $$y$$ coordinates. Call the perturbed positions $$\hat{p}$$ and $$\hat{q}$$, and the Euclidean distance between them $$\hat{d}$$. Ranacher and his colleagues show that for large $$n$$ the total GPS distance $$n \hat{d}$$ is strictly greater than $$nd$$ unless all the measurement errors are perfectly correlated. What happens if the measurements are perfectly correlated? If all the $$\hat{p}$$s and $$\hat{q}$$s are displaced from the $$p$$s and $$q$$s in exactly the same way, the errors leave the total distance unchanged, so that $$n \hat{d} = nd$$. They give a proof of this proposition, then go on to discuss the size of the effect, and finally report on two experiments with real GPS data.

I wanted to see for myself how measured distance grows as a function of GPS error, so I wrote a simple Monte Carlo program. The Ranacher proof makes no assumptions about the statistical distribution of the errors, but in a computer simulation it’s necessary to be more concrete. I chose a model where the GPS positions are drawn uniformly at random from square boxes of edge length $$2 \epsilon$$ centered on the points $$p$$ and $$q$$.

In the sketch above, the black dots, separated by distance $$d$$, represent the true endpoints of the runner’s path. The red dots are two GPS coordinates $$\hat{p}$$ and $$\hat{q}$$, and the red line gives the measured distance between them. We want to know the expected length of the red line averaged over all possible $$\hat{p}$$ and $$\hat{q}$$.

Getting the answer is quite easy if you’ll accept a numerical approximation based on a finite random sample. Write a few lines of code, pick some reasonable values for $$d$$ and $$\epsilon$$, crank up the random number generator, and run off 10 million iterations. Some results:

$$\epsilon$$ $$d$$ $$\hat{d}$$
0.0 1.0 1.0000
0.1 1.0 1.0034
0.2 1.0 1.0135
0.3 1.0 1.0306
0.4 1.0 1.0554
0.5 1.0 1.0882

For each value of $$\epsilon$$ the program generated $$10^7$$ $$(\hat{p}, \hat{q})$$ pairs, calculated the Euclidean distance $$\hat{d}$$ between them, and finally took the average $$\langle \hat{d} \rangle$$ of all the distances. It’s clear that $$\langle \hat{d} \rangle > d$$ when $$\epsilon > 0$$. Not so clear is where these particular numbers come from. Can we understand how $$\hat{d}$$ is determined by $$d$$ and $$\epsilon$$?

For a little while, I thought I had a simple explanation. I reasoned as follows: We already know from the one-dimensional case that the $$x$$ component of the measured distance has an expected value of $$d$$. The $$y$$ component, orthogonal to the direction of motion, is the difference between two randomly chosen points on a line of length $$2 \epsilon$$; a symmetry argument gives this length an expected value of $$2 \epsilon / 3$$. Hence the expected value of the measured distance is:

$\hat{d} = \sqrt{d^2 + \left(\frac{2 \epsilon}{3}\right)^2}\, .$

Ta-dah!

Then I tried plugging some numbers into that formula. With $$d = 1$$ and $$\epsilon = 0.3$$ I got a distance of 1.0198. The discrepancy between this value and the numerical result 1.0306 is much too large to dismiss.

What was my blunder? Repeat after me: The average of the squares is not the same as the square of the average. I was calculating the squared distance as $${ \langle x \rangle}^2 + {\langle y \rangle}^2$$ when what I should have been doing is $$\langle {x^2 + y^2}\rangle$$. We need to average over all possible distances between a point in one square and a point in the other, not over all $$x$$ and $$y$$ components of those distances. Trouble is, I don’t know how to calculate the correct distance.

I thought I’d try to find an easier problem. Suppose the runner stops to tie a shoelace, so that the true distance $$d$$ drops to zero; thus any movement detected is a result of GPS errors. As long as the runner remains stopped, the two error boxes exactly overlap, and so the problem reduces to finding the average distance between two randomly selected points in the unit square. Surely that’s not too hard! The answer ought to be some simple and tidy expression—don’t you think?

In fact the problem is not at all easy, and the answer is anything but tidy. We need to evaluate a terrifying quadruple integral:

$\iiiint_0^1 \sqrt{(x_q - x_p)^2 + (y_q - y_p)^2} \, dx_p \, dx_q \, dy_p \, dy_q\, .$

Lucky for me, I live in the age of MathOverflow and StackExchange, where powerful wizards have already done my homework for me. Another resource, if you have access: Steven R. Dunbar, “The Average Distance between Points in Geometric Figures,” The College Mathematics Journal, Vol. 28, No. 3 (May, 1997), pp. 187–197. The integral evaluates to:

$\frac{2+\sqrt{2}+5\log(1+\sqrt{2})}{15} \approx 0.52140543316$

Nothing to it, eh?

The corresponding expression for nonzero $$d$$ is doubtless even more of a monstrosity, but I’ve made no attempt to derive it. I am left with nothing but the Monte Carlo results. (For what it’s worth, the simulations do agree on the value $$\hat{d} = 0.5214$$ for $$d = 0$$).

I tried applying the Monte Carlo program to my 1,600-meter run. In the Runmeter data my position is sampled 100 times, or every six seconds on average, which means that $$d$$ (the true average distance between samples) should be about 16 meters. Estimating $$\epsilon$$ is not as easy. In the map above there’s one point on the blue path that’s displaced by at least 10 meters, but if we ignore that outlier most of the other points are probably within about 3 meters of the correct lane. Plugging in $$d = 16$$ and $$\epsilon = 3$$ yields about 1,620 meters as the expected measured distance.

What does the Runmeter app have to say? It reports a total distance of 1,599.5 meters, which is, I’m inclined to say, way too good to be true. Part of the explanation is that the measurement errors are not uniform random variables; there are strong correlations in both space and time. Also, measurement errors and interpolation errors surely have canceled out to some extent. (It’s even possible that the developers of the app have chosen the sampling interval to optimize the balance between the two error types.) Still, I have to say that I am quite surprised by this uncanny accuracy. I’ll have to run some more laps to see if the performance is repeatable.

Another thought: People have been measuring distances for millennia. How is it that no one noticed the asymmetric impact of measurement errors before the GPS era? Wouldn’t land surveyors have figured it out? Or navigators? Distinguished mathe­maticians, including Gauss and Legendre, took an interest in the statistical analysis of errors in surveying and geodesy. They even did field work. Apparently, though, they never stumbled on the curious fact that position errors orthogonal to the direction of measurement lead to a systematic bias toward greater lengths.

There’s yet another realm in which such biases may have important consequences: measurement in high-dimensional spaces. Inaccuracies that cause a statistical bias of 2 percent in two-dimensional space give rise to a 19 percent overestimate in 10-dimensional space. The reason is that errors along all the axes orthogonal to the direction of measurement contribute to the Euclidean distance. By the time you get to 1,000 spatial dimensions, the measured distance is more than six times the true distance.

Even for creatures like us who live their lives stuck in three-space, this observation might be more than just a mathematical curiosity. Lots of algorithms in machine learning, for example, measure distances between vectors in high-dimensional spaces. Some of those vectors may be closer than they appear.

Posted in computing, mathematics | 13 Comments

## Lost in Translation

J. E. Littlewood told a cute story about a paper he supposedly published in the Comptes Rendus de l’Académie des sciences de Paris. The paper had three footnotes, which read, in French:

1. I am greatly indebted to Prof. Riesz for translating the present paper.
2. I am indebted to Prof. Riesz for translating the preceding footnote.
3. I am indebted to Prof. Riesz for translating the preceding footnote.

Why stop at three? Littlewood explains: “However little French I know I am capable of copying a French sentence.”

I thought of this incident the other day when I received a letter from Medicare. At the top of the single sheet of paper was the heading “A Message About Medicare Premiums,” followed by a few paragraphs of text, and at the bottom this boldface note:

The information is printed in Spanish on the back

Naturally, I turned the page over. I found the heading “Un mensaje sobre las primas de Medicare,” followed by a few paragraphs of Spanish text, and then this in boldface:

La información en español está impresa al dorso

The line is a faithful translation of the English text from the other side of the sheet. (O el inglés es una traducción fiel del español.) But in this case neither copying nor faithful translation quite suffices. It seems we have fallen into the wrong symmetry group. The statement “This sentence is not in Spanish” is true, but its translation into Spanish, “Esta frase no está en español” is false. Apart from that self-referential tangle, if the two boldface notes in the letter are to be of any use to strictly monolingual readers, shouldn’t they be on opposite sides of the paper?

By the way, I had always thought the Littlewood three-footnote story referred to a real paper. But his account in A Mathematicians’s Miscellany suggests it was a prank he never had a chance to carry out. And in browsing the Comptes Rendus on Gallica, I find no evidence that Littlewood ever published there. [Please see comment below by Gerry Myerson.]

Posted in linguistics, problems and puzzles | 3 Comments

## The writing on the wall

South wall. Hover to magnify.

A place where thousands of people suffered and died makes an uncomfortable tourist destination, yet looking away from the horror seems even worse than staring. And so, when Ros and I were driving from Prague to Dresden last month, we took a slight detour to visit Terezín, the Czech site that was the Theresienstadt concentration camp from late 1941 to mid 1945. We expected to be disturbed, but we stumbled onto something that was disturbing in an unexpected way.

Terezín was not built as a Nazi concentration camp. It began as a fortress, erected in the 1790s to defend the Austrian empire from Prussian threats. Earthen ramparts and bastions surround buildings that were originally the barracks and stables for a garrison of a few thousand troops. By the 20th century the fortress no longer served any military purpose. The troops withdrew, civilians moved in, and the place became a town with a population of about 7,000.

In 1941 the Gestapo and the SS siezed Terezín, expelled the Czech residents, and began the “resettlement” of Jews deported from Prague and elsewhere. In the next three years 150,000 prisoners passed through the camp. All but 18,000 perished before the end of the war.

Now Terezín is again a Czech town, as well as a museum and memorial to the holocaust victims. It seems a lonely place. A few boys kick a ball around on the old parade ground, the café has two or three customers, someone is holding a rummage sale—but the town’s population and economy have not recovered. The museum occupies parts of a dozen buildings, but many of the others appear to be vacant.

We looked at the museum exhibits, then wandered off the route of the self-guided tour. At the edge of town, near a construction site, a tunnel passed under the fortifications. Walking through, we came out into a grassy strip of land between the inner and outer ramparts. When we turned back to the tunnel, we noticed graffiti on the walls of the portal.

At first I assumed it was recent adolescent scribbling, but on looking closer we began to see dates in the 1940s, carved into the sandstone blocks. Could it be true? Could these incised names and drawings really be messages from the concentration-camp era? If so, who left them for us? Did the prisoners have access to this tunnel, or was it an SS guard post?

North wall.

I was skeptical. Too good to be true, I thought. If the carvings were genuine, they would not have been left out here, exposed to the elements and unprotected against vandalism. They would be behind glass in one of the museum galleries. But if they were not genuine, what were they?

I took pictures. (The originals are on Flickr.)

Back home, some days later, my questions were answered. Googling for a few phrases I could read in the inscriptions turned up the website ghettospuren.de, which offers extensive documentation and interpretation (in Česky, Deutsch, and English). Briefly, the carvings are indeed authentic, as shown by photographs made in 1945 soon after the camp was liberated. The markings were made by members of the Ghettowache, the internal police force selected from the prison population. A dozen of the artists have been identified by name.

The website is the project of Uta Fischer, a city planner in Berlin, with the photographer Roland Wildberg and other German and Czech collaborators. They are working to preserve the carvings and several other artifacts discovered in Terezín in the past few years.

I offer a few notes and speculations on some of the inscriptions, drawing heavily on Fischer’s commentary and translations:

 “Brána střežena stráží ghetta L.P. 1944.” Translation from ghettospuren.de: “The gate is being guarded by the ghetto guard, A.D. 1944.” This sign, given a prominent position at the entrance to the tunnel, reads like a territorial declaration. The date is interesting. Are we to infer that the gate was not guarded by the stráží ghetta before 1944? “Pamatce na pobyt 1941–1944.” Translation from ghettospuren.de: “In remembrance of the stay 1941–1944.” Fischer remarks on the formality of the inscription, suggesting that this part of the south wall was created as “a collective place of remembrance.” The carving has been badly damaged since the first photos were made in 1945. A floral arrangement is the most elaborate of all the carvings. Fischer identifies the artist as Karel Russ, a shopkeeper in the Bohemian town of Kyšperk (now Letohrad). Fischer writes: “In the top center there is still a recognizable outline of the Star of David that was already removed in a rough manner in 1945.” For what it’s worth, I’m not so sure that’s not another flower. The deep hole in the middle was not present in 1945 and is not explained. Four caricatures of the same figure are lined up on a single sandstone block on the north wall, with a fifth squeezed into a narrow spot on the block below. Why the repetition? And who was the subject? The Italian legend “Il capitano della guardia” and the double stripe on the hat suggest a high-ranking Ghettowache official. Did he take these cartoonish portrayals with good humor? Or could the drawings possibly be selfies? The menorah at the bottom left of this panel is the only explicitly Jewish iconography I have spotted in these images. (As noted above, Fischer believes the floral panel included a Star of David.) As far as I can tell, there are no Hebrew inscriptions. Portraits of a man and a woman? That’s my best guess, but the carving is indistinct. The line above presumably reads “M.C. 1944,” but the “1″ has been gouged away. Not all of the inscriptions come from the Second World War. This one, signed “Alchuz Jan,” is dated August 6, 1911. Another (not shown) claims to be from 1871. It’s only to be expected that there are also later additions to the graffiti. Toward the bottom of this panel we have B.K. ♥ R.V. 1953. The white scrawl at top left is much more recent. On the other hand, the signature of “Waltuch Wilhelm” at upper right is from the war years. Fischer has identified him as the owner of a cinema in Vienna. Elsewhere he also signed his name in Cyrillic script.

I am curious about the chronology of the Ghettowache inscriptions. Are we seeing an accumulation of work carried out over a period of years, or was all the carving done in a few weeks or months? The preponderance of items dated 1944 argues for the latter view. In particular, the inscription “In remembrance of the stay 1941–1944” could not have been written before 1944, and it suggests some foreknowledge that the stay would soon be over.

A lot was going on at Terezín in 1944. In June, the camp was cleaned up for a stage-managed, sham inspection by the Red Cross; to reduce overcrowding in preparation for this event, part of the population was deported to Auschwitz. Later that summer, the SS produced a propaganda film portraying Theresienstadt as a pleasant retreat and retirement village for Jewish families; the film wasn’t really titled “The Führer Gives a Village to the Jews,” but it might as well have been. As soon as the filming was done, thousands more of the residents were sent to the death camps, including most of those who had acted in the movie. In the fall, with the war going badly for Germany, the SS decided to close the camp and transport everyone to the East. Perhaps that is when some of the inscriptions with a tone of finality were carved—but I’m only guessing about this.

As it happens, the liquidation of the ghetto was never completed, and in the spring of 1945 the flow of prisoners was reversed. Trains brought survivors back from the extermination camps in Poland, which were about to be overrun by the Red Army. When Terezín was liberated by the Soviets in early May, there were several thousand inmates. But the tunnel has no inscriptions dated 1945.

Graffiti is a varied genre. It encompasses scatological scribbling in the toilet stall, romantic declarations carved on tree trunks, the existential yawps of spray-paint taggers, dissident political slogans on city walls, religious ranting, sports fanaticism, and much else. It’s often provocative, sometimes indecent, imflammatory, insulting, or funny. The tunnel carvings at Terezín evoke a quite different set of adjectives: poignant, elegiac, calm, tender. It’s not surprising that we see no overtly political or accusatory statements—no strident “Let my people go,” no outing of torturers or collaborators. After all, these messages were written under the noses of a Nazi administration that wielded absolute and arbitrary power of life and death. Even so—even considering the circumstances—there’s an extraordinary emotional restraint on exhibit here.

What audience were the tunnel elegists addressing? I have to believe it was us, an unknown posterity who might wander by in some unimaginable future.

When Ros and I wandered by, the fact that we had discovered the place by pure chance, as if it were a treasure newly unearthed, made the experience all the more moving. Seeing the stones in a museum exhibit—curated, annotated, preserved—would have had less impact. Nevertheless, that is unquestionably where they belong. Uta Fischer and her colleagues are working to make that happen. I hope they succeed in time.

Posted in off-topic | 4 Comments

## Pumping the Primes

On your desktop is a black box. Actually it’s an orange box, because black boxes are usually painted “a highly visible vermilion colour known as international orange.” In any case, it’s an opaque box: You can’t see the whirling gears or the circuit boards or whatever else might be inside. There’s another, secret input: The tiny hole in the bottom right corner of the box. If you poke a paperclip into the hole, it will reset the mechanism.You have access only to the inputs and outputs. The input is a button labeled Next. The output is an adding machine tape.

Go ahead: Press the button. A number is printed on the tape. Press again and another number appears. Keep going. A few more. Notice anything special about those numbers? The sequence begins:

5, 3, 11, 3, 23, 3, 47, 3, 5, 3, 101, 3, 7, 11, 3, 13, 233, 3, 467, 3, 5, 3, . . .oeis.org/A137613

Yep, they’re all primes. They are not in canonical order, and some of them appear more than once, but every number in the list is certifiably indivisible by any number other than 1 and itself. Does the pattern continue? Yes, there’s a proof of that. Do all primes eventually find a place in the sequence? The very first prime, 2, is missing. Whether all odd primes eventually turn up remains a matter of conjecture. On the other hand, it’s been proved that infinitely many distinct primes are included.

So what’s inside the box? Here’s the JavaScript function that calculates the numbers printed on the tape. There’s not much to it:

  var n = 2, a = 7;   // initial values

function nextG() {
var g = gcd(n, a);
n = n + 1;
a = a + g;
return g;
}


The function gcd(n, a) computes the greatest common divisor of n and a. As it happens, gcd is not a built-in function in JavaScript, but there’s a very famous algorithm we can easily implement:

  function gcd(x, y) {
while (y > 0) {
var rem = x % y;    // remainder operator
x = y;
y = rem;
}
return x;
}


The value returned by nextG is not always a prime, but it’s always either $$1$$ or a prime. To see the primes alone, we can simply wrap nextG in a loop that filters out the $$1$$s. The following function is called every time you press the Next button on the orange black boxSee the rest of the source code on GitHub.:

  function nextPrime() {
var g;
do g = nextG() while (g === 1);     // skip 1s
return g;
}


For a clearer picture of where those primes (and $$1$$s) are coming from, it helps to tabulate the successive values of the three variables n, a, and g.

n    2   3   4   5   6   7   8   9  10   11   12   13   14   15   16   17   18   19   20   21   22   23
a    7   8   9  10  15  18  19  20  21   22   33   36   37   38   39   40   41   42   43   44   45   46
g    1   1   1   5   3   1   1   1   1   11    3    1    1    1    1    1    1    1    1    1    1   23


From the given initial values $$n = 2$$, $$a = 7$$, we first calculate $$g = \gcd(2, 7) = 1$$. Then $$n$$ and $$a$$ are updated: $$n = n + 1$$, $$a = a + g$$. On the next round the gcd operation again yields a $$1$$: $$g = \gcd(3, 8) = 1$$. But on the fourth iteration we finally get a prime: $$g = \gcd(5, 10) = 5$$. The assertion that $$g$$ is always either $$1$$ or a prime is equivalent to saying that $$n$$ and $$a$$ have at most one prime factor in common.

This curious generator of primes was discovered in 2003, during a summer school exploring Stephen Wolfram’s “New Kind of Science.” A group led by Matthew Frank investigated various nested recursions, including this one:

$a(n) = a(n-1) + gcd(n, a(n-1)).$

With the initial condition $$a(1) = 7$$, the sequence begins:

7, 8, 9, 10, 15, 18, 19, 20, 21, 22, 33, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 69, . . .oeis.org/A106108

Participants noticed that the sequence of first differences — $$a(n) - a(n-1)$$ — seemed to consist entirely of $$1$$s and primes:

1, 1, 1, 5, 3, 1, 1, 1, 1, 11, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 23, 3, . . .oeis.org/A132199

Stripping out the $$1$$s, the sequence of primes is the same as that generated by the orange black box:

5, 3, 11, 3, 23, 3, 47, 3, 5, 3, 101, 3, 7, 11, 3, 13, 233, 3, 467, 3, . . .oeis.org/A137613

During the summer school, Frank and his group computed 150 million elements of the sequence and observed no composite numbers, but their conjecture that the value is always $$1$$ or prime remained unproved. One of the students present that summer was Eric S. Rowland, who had just finished his undergraduate studies and was about undertake graduate work with Doron Zeilberger at Rutgers. In 2008 Rowland took another look at the gcd-based prime generator and proved the conjecture.

The sequence beginning with $$a(1) = 7$$ is not unique in this respect. Rowland’s proof applies to sequences with many other initial conditions as well—but not to all of them. For example, with the initial condition $$a(1) = 3765$$, the list of “primes” begins:

53, 5, 57, 5, 9, 13, 7, 71, 3, 41, 3, 4019, 3, 8039, . . .

Neither 57 nor 9 is a prime.

A number of other mathematicians have since elaborated on this work. Vladimir Shevelev gave an alternative proof and clarified the conditions that must be met for the proof to apply. Fernando Chamizo, Dulcinea Raboso, and Serafín Ruiz-Cabello showed that even if a sequence includes composites, there is a number $$k$$ beyond which all entries $$a(k)$$ are $$1$$ or prime. Benoit Cloitre explored several variations on the sequence, including one that depends on the least common multiple (lcm) rather than the greatest common factor; the lcm sequence is discussed further in a recent paper by Ruiz-Cabello.

Should we be surprised that a simple arithmetic procedure—two additions, a gcd, and an equality test—can pump out an endless stream of pure primality? I have been mulling over this question ever since I first heard about the Rowland sequence. I’m of two minds.

Part of the mystique of the primes is their unpredictability. We can estimate how many primes will be found in any given interval of the number line, and we can compile various other summary statistics, but no obvious rule or algorithm tells us exactly where the individual primes fall within the interval. For more on randomness and the primes, see the erudite essay of Terry Tao.There’s nothing random or indeterminate about the distribution of the primes, and yet, like a random sequence, the sequence of primes seems incompressible. The most concise way of describing the primes is just to list them. Thus a simple and compact formula that generates only primes seems a bit like magic.

But there’s another side to this story.See Underwood Dudley, “Formulas for Primes,” Mathematics Magazine, Vol. 56, No. 1 (Jan., 1983), pp. 17–22. It turns out the woods are full of prime-generating formulas. There’s this mysterious-looking formula for the nth prime, published in 1971 by J. M. Gandhi:

$p_n = \left\lfloor 1 - \log_2 \left( -\frac{1}{2} + \sum_{d|P_{n-1}} \frac{\mu(d)}{2^d - 1} \right) \right \rfloor.$

Here $$P_n$$ is the primorial product $$p_{1}p_{2}p_{3} \ldots p_{n}$$ and $$\mu$$ is the Möbius function. (If you don’t know what the Möbius function is or why you should care, Peter Sarnak explains it all.)

Way back in 1947, W. H. Mills offered a formula with just three symbols and a pair of floor brackets. He proved that a real number $$A$$ exists such that

$\left \lfloor A^{3^{n}}\right \rfloor$

is prime for all positive integers $$n$$. One possible value oeis.org/A051021 of $$A$$ lies somewhere in the neighborhood of 1.306377883863. The sequence of primes derived from that value begins:

2, 11, 1361, 2521008887, 16022236204009818131831320183, . . .oeis.org/A051254

A third example brings us back to the gcd function. For all $$n > 1$$, $$n$$ is prime if and only if $\gcd((n - 1)!, n) = 1.$

From this fact we can craft an algorithm that generates all the primes (and only the primes) in sequence.

The trouble with all these formulas is that they require prior knowledge of the primes, or else they have such knowledge already hidden away inside them. Solomon Golomb showed that Gandhi’s formula is just a disguised version of the sieve of Eratosthenes. The Mills formula requires us to calculate the constant $$A$$ to very high accuracy, and the only known way to do that is to work backward from knowledge of the primes. As for $$\gcd((n - 1)!, n) = 1$$, it’s really more of a joke than a formula; it just restates the definition that n is prime iff no integer greater than 1 divides it.

Underwood Dudley opined that formulas for the primes range “from worthless, to interesting, to astonishing.” That was back in 1983, before the Rowland sequence was known. Where shall we place this new formula on the Dudley spectrum?

Rowland argues that the sequence differs from the Gandhi and Mills formulas because it “is ‘naturally occurring’ in the sense that it was not constructed to generate primes but simply discovered to do so.” This statement is surely true historically. The group at the Wolfram summer school did not set out to find a prime generator but just stumbled upon it. However, perhaps the manner of discovery is not the most important criterion.

Let’s look again at what happens when the procedure NextG is invoked repeatedly, each time returning either $$1$$ or a prime.

n    2   3   4   5   6   7   8   9  10   11   12   13   14   15   16   17   18   19   20   21   22   23
a    7   8   9  10  15  18  19  20  21   22   33   36   37   38   39   40   41   42   43   44   45   46
g    1   1   1   5   3   1   1   1   1   11    3    1    1    1    1    1    1    1    1    1    1   23


In the $$g$$ row we find occasional primes or groups of consecutive primes separated by runs of $$1$$s. If the table were extended, some of these runs would become quite long. A good question to consider is how many times NextG must be called before you can expect to see a prime of a certain size—say, larger than 1,000,000. There’s an obvious lower bound. The value of $$gcd(n, a)$$ cannot be greater than either $$n$$ or $$a$$, and so you can’t possibly produce a prime greater than a million until $$n$$ is greater than a million. Since $$n$$ is incremented by $$1$$ on each call to NextG, at least a million iterations are needed. And that’s just a lower bound. As it happens, the Rowland sequence first produces a prime greater than 1,000,000 at $$n =$$ 3,891,298; the prime is 1,945,649.

The need to invoke NextG at least $$k$$ times to find a prime greater than $$k$$ means that the Rowland sequence is never going to be a magic charm for generating lots of big primes with little effort. As Rowland remarks, “a prime $$p$$ appears only after $$\frac{p - 3}{2}$$ consecutive $$1$$s, and indeed the primality of $$p$$ is being established essentially by trial division.”

Rowland also points out a shortcut, which is best explained by again printing out our table of successive $$n, a, g$$ values, with an extra row for some $$a - n$$ values:

n    2   3   4   5   6   7   8   9  10   11   12   13   14   15   16   17   18   19   20   21   22   23
a    7   8   9  10  15  18  19  20  21   22   33   36   37   38   39   40   41   42   43   44   45   46
g    1   1   1   5   3   1   1   1   1   11    3    1    1    1    1    1    1    1    1    1    1   23

a–n  5   5   5          11  11  11  11             23   23   23   23   23   23   23   23   23   23


Within each run of $$1$$s, $$a-n$$ is a constant—necessarily so, because both $$a$$ and $$n$$ are being incremented by $$1$$ on each iteration. What’s more, based on what we can see in this segment of the sequence, the value of $$a-n$$ during a run of $$1$$s is equal to the next value of $$n$$ that will yield a nontrivial gcd. This observation suggests a very simple way of skipping over all those annoying little $$1$$s. Whenever $$gcd(a, n)$$ delivers a $$1$$, set $$n$$ equal to $$a-n$$ and increment $$a$$ by $$a - 2n$$. Here are the first few values returned by this procedure:

5, 3, 11, 3, 23, 3, 47, 3, 95, 3, . . .

Uh oh. 95 is not my idea of a prime number. It turns out the shortcut only works when $$a-n$$ is prime. To repair the defect, we could apply a primality test to each value of $$a-n$$ before taking the shortcut. But if we’re going to build a primality test into our prime generator, we might as well use it directly to choose the primes.

It seems we are back where we began, and no closer to having a practical prime generator. Nevertheless, on Dudley’s scale I would not rank this idea as “worthless.” When you take a long hike around the shore of a lake, you eventually wind up exactly where you started, but that does not make the trip worthless. Along the way you may have seen something interesting, or even astonishing.

Posted in computing, mathematics | 11 Comments

## Gazing into the CodeMirror

Marijn Haverbeke is a programmer based in Berlin whose book Eloquent JavaScript is freely available on the Web and published in various other formats by No Starch Press. It’s a fine book, and I warmly recommend it. Buying a copy would help support the author, but even Haverbeke might concede that the online version offers the superior experience. On the web, the code comes alive. You can run a program and see its output immediately. You can edit the program directly in the web page. You can write and run your own programs. It’s all frictionless.

The technology behind this magic trick is a JavaScript component called CodeMirror, which is also Haverbeke’s creation. It is a code editor that can be embedded in any web page, providing all the little luxuries we’ve come to expect in a modern programming environment: automatic indentation and bracket matching, syntax coloring, auto­completion. The program and all of its many addons are free and open source. By now it is very widely used but not as widely noticed, because it tends to get buried in the infrastructure of other projects. I am writing this brief note to bring a little attention to an underappreciated gem.

Haverbeke’s Eloquent JavaScript is not the only online book that relies on Codemirror. Another one that I find very engaging is Probabilistic Models of Cognition, by Noah D. Goodman and Joshua B. Tenenbaum, which introduces the Church programming language. There’s also The Design and Implementation of Probabilistic Programming Languages, by Goodman and Andreas Stuhlmüller, which provides a similar introduction to a language called WebPPL. And the Interactive SICP brings in-the-page editing to the Sussman-and-Abelson Dragon Wizard Book.

But CodeMirror has spread far beyond these pedagogic projects. It is built into the developer tools of both the Chrome and the Firefox web browsers. It is the editor module for the IPython notebook interface (a.k.a. Jupyter), which is also used by the Sage mathematics system. It’s in both Brackets and Light Table, two newish open-source code editors (which run as desktop applications rather than in a browser window). You’ll also find CodeMirror working behind the scenes at Bitbucket, JSFiddle, Paper.js, and close to 150 other placed.

I had only a vague awareness of CodeMirror until a few months ago, when the New England Science Writers put on a workshop for science journalists, Telling Science Stories with Code and Data. As part of that project I wrote an online tutorial, JavaScript in a Jiffy. CodeMirror was the obvious tool for the job, and it turned out to be a pleasure to work with. A single statement converts an ordinary textarea HTML element into fully equipped editor panel. Any text entered in that panel will automatically be styled appropriately for the selected programming language. The machinery allowing the user to run the code is almost as simple: Grab the current content of the editor panel, wrap it in a <script>...</script> tag, and append the resulting element to the end of the document. (Admittedly, this process would be messier with any language other than JavaScript.)

Just a screenshot, regrettably. You’ll need to go to JavaScript in a Jiffy to edit and run the code.

The trickiest part of the project was figuring out how to handle the output of the programs written in CodeMirror panels. Initially I thought it would be best to just use the browser’s JavaScript console, sending textual output as a series of console.log messages. This plan has the advantage of verisimilitude: If you’re actually going to create JavaScript programs, the console is where test results and other diagnostic information get dumped. You need to get used to it. But some of the workshop participants found the rigmarole of opening the browser’s developer tools cumbersome and confusing. So I went back and created pop-up panels within the page to display the output. (It still goes to the console as well.)

A project like this would have been beyond my abilities if I had had to build all the machinery myself. Having free access to such elegant and powerful tools leaves me with the dizzy sensation that I have stumbled into an Emerald City where the streets are paved with jewels. It’s not just that someone has taken the trouble to create a marvel like CodeMirror. They have also chosen to make it available to all of us. And of course Haverbeke is not alone in this; there’s a huge community of talented programmers, fiercely competing with one another to give away marvels of ingenuity. Who’d’ve thunk it?

Posted in computing | 2 Comments

## AIpocalypse

Murder by Robot in R.U.R. (Image from Wikipedia.)The robot that runs amok and turns on its maker has been a staple of fiction and film for at least a century. The plotline goes back to Karel ?apek’s 1921 play R.U.R., with earlier shadows of the same idea in Mary Shelley’s Frankenstein and the golem stories of Jewish folklore. Nowadays we have Arnold Schwarzenegger dressed up as The Terminator.

A number of thoughtful people (including Stephen Hawking, Nick Bostrom, Bill Gates) believe we should take the threat of AI insurrection seriously. They argue that in decades to come we could very well create some sort of conscious entity that might decide the planet would be a nicer place without us.

In the meantime there are lesser but more urgent threats—machines that would not exterminate our species but might make our lives a lot less fun. An open letter released earlier this week, at the International Joint Conference on AI, calls for an international ban on autonomous weapon systems.

Autonomous weapons select and engage targets without human intervention. They might include, for example, armed quadcopters that can search for and eliminate people meeting certain pre-defined criteria, but do not include cruise missiles or remotely piloted drones for which humans make all targeting decisions. Artificial Intelligence (AI) technology has reached a point where the deployment of such systems is — practically if not legally — feasible within years, not decades, and the stakes are high: autonomous weapons have been described as the third revolution in warfare, after gunpowder and nuclear arms.

When I last checked, the letter had 2,414 signers who identify themselves as AI/robotics researchers, and 14,078 other endorsers. I’ve added my name to the latter list.

A United Nations declaration, or even a multilateral treaty, is not going to totally prevent the development and use of such weapons. The underlying technologies are too readily accessible. The self-driving car that can deliver the kids to soccer practice can also deliver a bomb. The chip inside a digital camera that recognizes a smiling face and automatically trips the shutter might also recognize a soldier and pull the trigger. As the open letter points out:

Unlike nuclear weapons, they require no costly or hard-to-obtain raw materials, so they will become ubiquitous and cheap for all significant military powers to mass-produce. It will only be a matter of time until they appear on the black market and in the hands of terrorists, dictators wishing to better control their populace, warlords wishing to perpetrate ethnic cleansing, etc.

What would Isaac Asimov say about all this?

I was lucky enough to meet Asimov, though only once, and late in his life. He was in a hospital bed, recovering from heart surgery. He handed me his business card:

ISAAC ASIMOV

Natural Resource

No false modesty in this guy. But despite this braggadocio, he could equally well have handed out cards reading:

ISAAC ASIMOV

Gentle Soul

Asimov was a Humanist with a capital H, and he endowed the robots in his stories with humanistic ethics. They were the very opposite of killer machines. Their platinum-iridium positronic brains were hard-wired with rules that forbade harming people, and they would intervene to prevent people from harming people. Several of the stories describe robots struggling with moral dilemmas as they try to reconcile conflicts in the Three Laws of Robotics.

Asimov wanted to believe that when technology finally caught up with science fiction, all sentient robots and other artificial minds would be equipped with some version of his three laws. The trouble is, we seem to be stuck at a dangerous intermediate point along the path to such sentient beings. We know how to build machines capable of performing autonomous actions—perhaps including lethal actions—but we don’t yet know how to build machines capable of assuming moral responsibility for their actions. We can teach a robot to shoot, but not to understand what it means to kill.

Ever since the 1950s, much work on artificial intelligence and robotics has been funded by military agencies. The early money came from the Office of Naval Research (ONR) and from ARPA, which is now DARPA, the Defense Advanced Research Projects Agency. Military support continues today; witness the recently concluded DARPA Robotics Challenge. As far as I know, none of the projects currently under way in the U.S. aims to produce a “weaponized robot.” On the other hand, as far as I know, that goal has never been renounced either.

Posted in computing, modern life | 7 Comments

## Meet the Julians

JuliaCon, the annual gathering of the community focused on the Julia programming language, convened last week at MIT. I hung out there for a couple of days, learned a lot, and had a great time. I want to report back with a few words about the Julia language and a few more about the Julia community.

It’s remarkable that after 70 years of programming language development, we’re still busily exploring all the nooks and crannies of the design space. The last time I looked at the numbers, there were at least 2,500 programming languages, and maybe 8,500. But it seems there’s always room for one more. The slide at left (from a talk by David Beach of Numerica) sums up the case for Julia: We need something easier than C++, faster than Python, freer than Matlab, and newer than Fortran. Needless to say, the consensus at this meeting was that Julia is the answer.

Where does Julia fit in among all those older languages? The surface syntax of Julia code looks vaguely Pythonic, turning its back on the fussy punctuation of the C family. Other telltale traits suggest a heritage in Fortran and Matlab; for example, arrays are indexed starting with 1 rather than 0, and they are stored in column-major order. And there’s a strong suite of tools for working with matrices and other elements of linear algebra, appealing to numericists. Looking a little deeper, some of the most distinctive features of Julia have no strong connection with any of the languages mentioned in Beach’s slide. In particular, Julia relies heavily on generic functions, which came out of the Lisp world. (Roughly speaking, a generic function is a collection of methods, like the methods of an object in Smalltalk, but without the object.)

Perhaps a snippet of code is a better way to describe the language than all these comparisons. Here’s a fibonacci function:

function fib(n)
a = b = BigInt(1)
for i in 1:n
a, b = b, a+b
end
return a
end

Note the syntax of the for loop, which is similar to Python’s for i in range(n):, and very different from C’s for (var i=0; i<n; i++). But Julia dispenses with Python’s colons, instead marking the end of a code block. And indentation is strictly for human readers; it doesn’t determine program meaning, as it does in Python.

For a language that emphasizes matrix operations, maybe this version of the fibonacci function would be considered more idiomatic:

function fibmat(n)
a = BigInt[1 1; 1 0]
return (a^n)[1, 2]
end

What’s happening here, in case it’s not obvious, is that we’re taking the nth power of the matrix $\begin{bmatrix}1& 1\\1& 0\end{bmatrix}$ and returning the lower left element of the product, which is equal to the nth fibonacci number. The matrix-power version is 25 times faster than the loop version.

@time fib(10000)
elapsed time: 0.007243102 seconds (4859088 bytes allocated)

@time fibmat(10000)
elapsed time: 0.000265076 seconds (43608 bytes allocated)

Julia’s base language has quite a rich assortment of built-in functions, but there are also 600+ registered packages that extend the language in ways large and small, as well as a package manager to automate their installation. The entire Julia ecosystem is open source and managed through GitHub.

When it comes to programming environments, Julia offers something for everybody. You can use a traditional edit-compile-run cycle; there’s a REPL that runs in a terminal window; and there’s a lightweight IDE called Juno. But my favorite is the IPython/Jupyter notebook interface, which works just as smoothly for Julia as it does for Python. (With a cloud service called JuliaBox, you can run Julia in a browser window without installing anything.)

I’ve been following the Julia movement for a couple of years, but last week’s meeting was my first exposure to the community of Julia developers. Immediate impression: Youth! It’s not just that I was the oldest person in the room; I’m used to that. It’s how much older. Keno Fischer is now an undergrad at Harvard, but he was still in high school when he wrote the Julia REPL. Zachary Yedidia, who demoed an amazing package for physics-based simulations and animations, has not yet finished high school. Several other speakers were grad students. Even the suits in attendance—a couple of hedge fund managers whose firm helped fund the event—were in jeans with shirt tails untucked.

Four of the ringleaders of the Julia movement. From left: Stefan Karpinski, Viral B. Shah, Jeff Bezanson, Keno Fischer.

Second observation: These kids are having fun! They have a project they believe in; they’re zealous and enthusiastic; they’re talented enough to build whatever they want and make it work. And the world is paying attention. Everybody gets to be a superhero.

By now we’re well into the second generation of the free software movement, and although the underlying principles haven’t really changed, the vibe is different. Early on, when GNU was getting started, and then Linux, and projects like OpenOffice, the primary goal was access to source code, so that you could know what a program was doing, fix it if it broke, customize it to meet your needs, and take it with you when you moved to new hardware. Within the open-source community, that much is taken for granted now, but serious hackers want more. The game is not just to control your own copy of a program but to earn influence over the direction of the project as a whole. To put it in GitHub terminology, it’s not enough to be able to clone or fork the repo, and thereby get a private copy; you want the owners of the repo to accept your pull requests, and merge your own work into the main branch of development.

GitHub itself may have a lot to do with the emergence of this mode of collective work. It puts everything out in public—not just the code but also discussions among programmers and a detailed record of who did what. And it provides a simple mechanism for anyone to propose an addition or improvement. Earlier open-source projects tended to put a little more friction into the process of becoming a contributor.

In any case, I am fascinated by the social structure of the communities that form around certain GitHub projects. There’s a delicate balance between collaboration (everybody wants to advance the common cause) and competition (everyone wants to move up the list of contributors, ranked by number of commits to the code base). Maintaining that balance is also a delicate task. The health of the enterprise depends on attracting brilliant and creative people, and persuading them to freely contribute their work. But brilliant creative people bring ideas and agendas of their own.

The kind of exuberance I witnessed at JuliaCon last week can’t last forever. That’s sad, but there’s no helping it. One reason we have those 2,500 (or 8,500) programming languages is that it’s a lot more fun to invent a new one than it is to maintain a mature one. Julia is still six tenths of a version number short of 1.0, with lots of new territory to explore; I plan to enjoy it while I can.

Quick notes on a few of the talks at the conference.

Zenna Tavares described sigma.jl, a Julia package for probabilistic programming—another hot topic I’m trying to catch up with. Probabilistic programming languages try to automate the process of statistical modeling and inference, which means they need to build things like Markov chain Monte Carlo solvers into the infrastructure of the programming language. Tavares’s language also has a SAT solver built in.

Chiyuan Zhang gave us mocha.jl, a deep-learning/neural-network package inspired by the C++ framework Caffe. Watching the demo, I had the feeling I might actually be able to set up my own multilayer neural network on my kitchen table, but I haven’t put that feeling to the test yet.

Finally, because of parallel sessions I missed the first half of Pontus Stenetorp’s talk on “Suitably Naming a Child with Multiple Nationalities using Julia.” I got there just in time for the big unveiling. I was sure the chosen name would turn out to be “Julia.” But it turns out the top three names for the offspring of Swedish and Japanese parents is:

Steneport wants to extend his algorithm to more language pairs. And he also needs to tell his spouse about the results of this work.

Posted in computing | 3 Comments

## Fast money, slow data

The other day I bought lunch from a food truck on the Harvard campus, paying with a debit card that my server swiped through one of those little plastic doodads attached to an iPhone. Ten minutes later, while I ate my báhn mì in a shady corner of Harvard Yard, I opened my laptop and logged on to my bank’s website. The $7 transaction had already been posted to my account. So here’s a question to think about: If the credit card networks and the banks can track the movement of money minute by minute, why does it take months to calculate the overall level of economic activity in the U.S.? The Department of Commerce reports on gross domestic product (GDP) at quarterly intervals. This number (or its first derivative) is often cited as a kind of benchmark of economic wellbeing. The initial announcement comes a month after the close of each quarter; for Q1 2015, the news was released April 29. That news wasn’t cheering: a measly 0.2 percent rate of growth. The New York Times coverage began, “Repeating an all-too-familiar pattern, the American economy slowed to a crawl in the first quarter of 2015….” But that wasn’t the last word on the first quarter. At the end of May, a revised report came out, saying the winter months had been even bleaker than first thought: not +0.2 percent but –0.7 percent. Then yesterday a third and “final” estimate was released (PDF). It was a Goldilocks number, –0.2 percent, pretty near the mean of the two earlier figures. Said the Times: “While nothing to brag about, the economy’s performance in early 2015 was not quite as bad as the number-crunchers in Washington had thought.” If we take it for granted that the end-of-June GDP estimate is somewhere near correct, then the first two reports were worse than useless; they were misleading. Taking action on the basis of those numbers—making an investment, say, or setting an interest rate—would have been foolish. It seems that if you want to know how the economy is doing in the first quarter, you have to wait until the second quarter ends. And of course we’re about to repeat the cycle. How did business fare this spring? Check back at the end of September, when my$7 food-truck sandwich may finally register in the government’s books.

Why so slow? To answer this, I thought I ought to learn a little something about how the Department of Commerce calculates GDP. Here’s where I learned that little something:

Landefeld, J. Steven, Eugene P. Seskin, and Barbara M. Fraumeni. 2008. Taking the pulse of the economy: Measuring GDP. Journal of Economic Perspectives 22(2):193–216. PDF.

It’s worse than I had guessed. The baseline for all these “national accounts estimates” is an economic census conducted every five years (in years congruent to 2 mod 5). For the 19 quarters between one economic census and the next, numbers are filled in by extrapolation, guided by a miscellany of other data sources. Landefeld et al. write:

The challenge lies in developing a framework and methods that take these economic census data and combine them using a mosaic of monthly, quarterly, and annual economic indicators to produce quarterly and annual GDP estimates. For example, one problem is that the other economic indicators that are used to extrapolate GDP in between the five-year economic census data—such as retail sales, housing starts, and manufacturers shipments of capital goods—are often collected for purposes other than estimating GDP and may embody definitions that differ from those used in the national accounts. Another problem is some data are simply not available for the earlier estimates. For the initial monthly estimates of quarterly GDP, data on about 25 percent of GDP— especially in the service sector—are not available, and so these sectors of the economy are estimated based on past trends and whatever related data are available. For example, estimates of consumer spending for electricity and gas are extrapolated using past heating and cooling degree data and the actual temperatures, while spending for medical care, education, and welfare services are extrapolated using employment, hours, and earnings data for these sectors from the Bureau of Labor Statistics.

Is this methodology really the best way to measure the state of the economy in the age of Big Data?

Actually, there’s a lot to be said for the quinquennial economic census. It goes to a large sample (four million businesses), and the companies are compelled by law to respond, which mitigates selection bias. The data series goes back to 1934; maintaining continuity with past measurements is valuable. Furthermore, the census and other survey-based instruments probe much more than just transactional data. They try to quantify what a company manufactures (or what services it provides), what labor and material inputs are consumed, and where the company’s revenue comes from. The analysis includes not just income and expenditures but also depreciation and amortization and other scary abstractions from the world of accounting. You can’t get all that detail just by following the money as it sloshes around the banking system.

Still, can we afford to wait five years between surveys? Three months before we get a reliable (?) guess about what happened in the previous quarter? Consider the predicament of the Federal Reserve Board, trying to walk the narrow line between encouraging economic growth and managing inflation. This is a problem in control theory, where delayed feedbacks lead to disastrous instabilities. (Presumably Janet Yellin and her colleagues have access to more timely data than I do. I hope so.)

Could we really create an up-to-the-minute measure of national economic health by mining credit card data, bank accounts, supermarket inventories, and food-truck receipts? I really don’t know. But I’ll quote a 2014 Science review by Liran Einav and Jonathan Levin:

Whereas the popular press has focused on the vast amount of information collected by Internet companies such as Google, Amazon, and Facebook, firms in every sector of the economy now routinely collect and aggregate data on their customers and their internal businesses. Banks, credit card companies, and insurers collect detailed data on household and business financial interactions. Retailers such as Walmart and Target collect data on consumer spending, wholesale prices, and inventories. Private companies that specialize in data aggregation, such as credit bureaus or marketing companies such as Acxiom, are assembling rich individual-level data on virtually every household….

One potential application of private sector data is to create statistics on aggregate economic activity that can be used to track the economy or as inputs to other research. Already the payroll service company ADP publishes monthly employment statistics in advance of the Bureau of Labor Statistics, MasterCard makes available retail sales numbers, and Zillow generates house price indices at the county level. These data may be less definitive than the eventual government statistics, but in principle they can be provided faster and perhaps at a more granular level, making them useful complements to traditional economic statistics.

I suppose I should also mention worries about giving government agencies access to so much personal financial data. But that horse is out of the barn.

Posted in social science, statistics | 5 Comments