Archive for April, 2006

Setting the river straight

Saturday, April 29th, 2006

A post here a few weeks ago discussed the late Luna B. Leopold and his work on the shapes of rivers. I mentioned in that item that Leopold collected data on a small stream called Watts Branch, near Washington, DC. The statement was correct, but I identified the wrong Watts Branch. There are two of them in the Washington area. The one that Leopold studied is not the one east of the city but the one to the northwest, flowing through Rockville, Maryland. This weekend, with the help of Andrew J. Miller of the University of Maryland Baltimore County, who worked with Leopold on Watts Branch, I found the right stream. Indeed, I was able to tramp along the section of Watts Branch where Leopold and his colleagues made their observations.

Watts Branch stream, Rockville, MD

The photo above shows an area just upstream of where Leopold measured the cross section of the channel and measured the flow. The area was once a cow pasture but is now a floodplain greenway entirely surrounded by suburban housing developments. The boulders emplaced on the bank are part of a project meant to stabilize the path of the stream, which has wandered extensively over the floodplain in the past few decades.

A further note: On April 26 Leopold was posthumously awarded the 2006 Benjamin Franklin Medal in Earth and Environmental Science, jointly with his long-time colleague M. Gordon Wolman of the Johns Hopkins University.

Talkin’ Towson

Monday, April 24th, 2006

I’ll be giving a talk at Towson University later this week. Details here. If you’re in the Baltimore area, please come by.

PageRank for physicists

Thursday, April 20th, 2006

Scientists are selfless seekers after truth, unswayed by worldly emoluments, immune to the tawdry enticements of fame, indifferent to prizes and honors. Thus I can’t quite imagine why anyone would bother ranking a collection of scientific papers by applying the algorithm that Google uses to decide which Web pages deserve the most prominent display. Nonetheless, if you’re an author of anything published in Physical Review or its various offshoots over the past century or so, P. Chen, H. Xie, S. Maslov and S. Redner have calculated the Google PageRank of your publications. The database they analyzed consists of 353,268 papers published in various American Physical Society journals between 1893 and 2003.

The best-known measures of “impact” for scholarly publications are based on counting citations; a paper’s impact rises when other papers refer to it. The PageRank algorithm applies this principle recursively. If paper A is cited by paper B, that fact will be weighted more heavily if paper B itself gets many citations. Chen et al. adapted this method of evaluation to the 3,110,839 citations within their network of Physical Review articles. They found that most of the papers with a high PageRank score are well-known works that also get high marks under other kinds of citation analysis. But there were some surprises—papers they call scientific gems. An example is a 1980 article by H. Rosenstock and C. Marquardt titled “Cluster formation in two dimensional random walks: Application to photolysis of silver halides.” Only three other articles in the database cite Rosenstock and Marquardt, and thus it remains an obscure publication from the point of view of most citation analyses; but it rises to 85th place in the PageRank standings because one of those three citing papers is itself a highly rated work. (The latter paper, by T. Witten and L. Sander, is “Diffusion-Limited Aggregation, a Kinetic Critical Phenomenon,” with 680 citations as of June 2003. Chen et al. note: “The Witten and Sander article has only 10 references; thus a substantial fraction of its fame is exported to [Rosenstock and Marquardt] by the Google PageRank algorithm.)

Unfortunately, the database used by Chen, Xie, Maslov and Redner is not publically available, and so there is no convenient way for physicists to check their own PageRank standings. But, then again, physicists are so egoless, I’m sure they wouldn’t bother anyway.

See: Finding Scientific Gems with Google, on the arXiv.

Brackets

Monday, April 17th, 2006

A thought for the day:

Our federal income tax law defines the tax y to be paid in terms of the income x; it does so in a clumsy enough way by pasting several linear functions together, each valid in another interval or bracket of income. An archeologist who, five thousand years from now, shall unearth some of our income tax returns together with relics of engineering works and mathematical books, will probably date them a couple of centuries earlier, certainly before Galileo and Vieta.

—Hermann Weyl, “The Mathematical Way of Thinking,” 1940.

Daniel J. Velleman has a suggestion for the Internal Revenue Service (thanks to Barry Cipra for this pointer).

Summing up

Tuesday, April 11th, 2006

The new issue of American Scientist is up on the Web, and the print edition will soon be delighting readers everywhere. My Computing Science column in this issue of the magazine revisits a famous bit of mathematical folklore—the story about young Carl Friedrich Gauss confounding his boorish schoolmaster. In one popular version of the story, the teacher, whose name was Büttner, instructed his class to sum all the whole numbers from 1 through 100. Büttner expected an hour of peace and quiet while the pupils labored over this task, but Gauss almost instantly chalked a number on his slate: 5,050. Gauss had somehow derived the formula n(n+1)/2 for the sum of the integers from 1 through n.

Let us transpose this fable to a modern classroom, where laptop computers have replaced slates. The enlightened 21st-century Büttner would never dream of giving out such a dreary assignment as adding up a long series of numbers, but perhaps he might propose the following exercise: Write a program that for any given integer n returns the sum of the first n integers. The aim here is not just to get the arithmetic right but to produce the “best” program. And thus the question becomes: What does “best” mean in this context? In my column, I suggest that the dull, brute-force approach—just adding up the elements of the series, one by one—might actually be preferable to a program based on Gauss’s ingenious formula. My rationale for this assertion is that the straightforward program would be easier to get right and easier to generalize. But is it true?

Friends have been bugging me to give up my Lispish ways and do some programming in Python, so I’ll use that language for the examples given here. Here’s a Python version of Gauss’s trick:

def gauss(n):
    return n*(n+1)/2

Not much to it, I have to admit. And the only way I can imagine getting it wrong would be to leave out the parentheses, so that the function would return n2 + 1/2. (In Python, for what it’s worth, “1/2” is equal to 0, so the actual returned value would be simply n2.)

The brute-force summation could also be written as a Python two-liner:

def brute(n):
    return sum(range(1,n+1))

However, relying on Python’s built-in sum function is not quite fair here, in my view; it assumes the solution to the problem, without telling us anything about what’s going on behind the scenes. For all we know, sum could implement the n(n+1)/2 algorithm, in which case it would not be brutish at all. A more-perspicuous version of the brute function might look like this:

def brute(n):
    total = 0
    for k in range(1,n+1):
        total += k
    return total

Now it’s clear (isn’t it?) that we are looping through a series of numbers, adding each element of the series to a running total.

Which of these functions, gauss or brute, is the “better” program? Which one would deserve higher marks in Büttner’s class? As far as I can tell, they both return correct results for every positive integer n—at least until the memory capacity of the computer is exceeded. But the gauss version has an important advantage in efficiency: It is an O(1) program, meaning that the running time is essentially constant, regardless of the value of n, whereas the brute program is O(n), with running time proportional to n. The difference is undetectable for Büttner’s original problem of summing the numbers from 1 to 100, but if you’re interested in larger values of n, I would strongly recommend gauss(10**10) over brute(10**10). So score one for gauss.

Efficiency isn’t everything, however. My initial thought was that the brute program would be simpler or more straightforward—it would translate the definition of the problem directly into code—and thus would be the most reliable way to avoid errors. Now I’m not so sure. For one thing, I have to confess that I got the brute program wrong on my first try. In the for loop I wrote range(1,n), believing that this Python expression would return a list of the integers from 1 through n inclusive; in fact the list extends from 1 through n–1. I sincerely hope you will agree that my expectation in this case is perfectly sensible, and the language is doing something utterly goofy, but in the end that’s not the point. The uncomfortable fact is, regardless of programming-language quirks, off-by-one errors are a hazard whenever and however you write a loop.

Eschewing Python’s range feature doesn’t really solve the problem. Here is a version of brute with a more-explicit specification of the loop structure:

def brute(n):
    k = 1
    total = 0
    while k <= n:
        total += k
        k += 1
    return total

I’m going to assert that this is a correct program, but to verify my claim you may need to think carefully about the loop-termination condition (perhaps it should be k < n?) as well as the order of the two statements within the while loop. If anything, this version of the program offers more opportunities to go wrong.

All of these difficulties suggest that the gauss program may well be the simpler and more-reliable one after all, but I am going to persist in trying to defend the brute-force approach. For starters, how do we know that the gauss program is correct? It’s easy enough to verify that the Python code n*(n+1)/2 matches the mathematical expression n(n+1)/2, but how do we know the mathematics is correct—that the expression actually represents the sum of the natural numbers up to n? There are several different conjectures about how the schoolboy Gauss might have discovered the identity:

\\sum_{k=1}^{n}k = \\frac{n(n+1)}{2}.

(For a discussion of these speculations, see my column; for the texts of 100+ versions of the story, see this HTML document.) As it happens, there’s no doubt the formula is correct; an inductive proof is not hard to follow. Nevertheless, when you consider the derivation of the formula as part of the process of writing the program, the gauss procedure is no longer quite so quick and easy.

The outlook gets even cloudier when you consider variations and generalizations. Suppose Büttner assigns a followup problem: Write a program that for any positive integer n computes the sum of the squares of the integers from 1 through n. For all the dullards who wrote something like the brute routine for the first exercise, this will be trivial modification of the earlier code:

def brute_squares(n):
    total = 0
    for k in range(1,n+1):
        total += k*k
    return total

But what will Gauss do? Maybe he knows (or can derive on the spot) the identity:

\\sum_{k=1}^{n}k^2 = \\frac{n(n+1)(2n+1)}{6},

which leads immediately to the Python function:

def gauss_squares(n):
    return n*(n+1)*((2*n)+1)/6.

So far so good. But then Büttner can ask for the sum of the cubes of the first n numbers. Again, the looping, iterative program is easy to modify; it’s just a matter of replacing k*k with k*k*k or k**3 or pow(k,3). Young Gauss has a harder task; he needs yet another flash of insight to come up with the nonobvious identity:

\\sum_{k=1}^{n}k^3 = \\left(\\sum_{k=1}^{n}k\\right)^2,

or in other words:

(1^3 + 2^3 + 3^3 + \\dots + n^3) = (1 + 2 + 3 + \\dots + n)^2.

The corresponding Python program,

def gauss_cubes(n):
    return gauss(n)**2

is wonderfully clear and succinct, but I think most of us would agree that the mathematical thought behind it is nontrivial.

With Carl Friedrich Gauss in the classroom, this sort of bantering competition between closed-form formulas and iterative computations could go on for some time, but I think Büttner could finally bring it to a close with this assignment: Write a program that accepts two arguments, namely a positive integer n and a function f that accepts an integer argument and returns a real result; the program must return the sum of f(1) through f(n). For the brutes among us, even this generalization of the original task is not much of a challenge. The program might look like this:

def brute_f(n,f):
    total = 0
    for k in range(1,n+1):
        total += f(k)
    return total

Invoking this program as brute_f(100,log) would calculate the sum of the natural logarithms of the first 100 integers, and brute_f(100,lambda(x):x**x) would yield the sum 11 + 22 + 33 + … + 100100. Perhaps Gauss could come up with a gauss_f program that would compute all such results in closed form—but I sure can’t.

At the end of the day, what’s the verdict on these two approaches to the summation process? I have to admit that my original intuition—my sense that the iterative program would be easier to write without errors—does not hold up very well to the test of experiment. But I am stubborn enough to go on defending the brute programs anyway. My new excuse is that they capture the abstraction, or the pattern, of summing a series in a way that the cleverer gauss programs do not. With explicit loops, you can see just what’s being computed; the formulas embedded in the various gauss programs seem opaque and mysterious by comparison. But maybe I’m just sore because I’m not sharp enough to invent those Gaussian formulas; in Büttner’s class, I know I’d be one of the other kids, the ones who add up all those numbers the painstaking and tedious way.

Reversing history

Sunday, April 9th, 2006

Several posts here (1, 2, 3, 4) as well as my March–April column in American Scientist have looked at technologies for building reversible computing machines. Most of the discussion focuses on “combinational” logic—networks of devices that have no feedback loops or memory elements. In a purely combinational circuit, the state of the outputs is determined entirely by the current state of the inputs. But real computers are not made up of purely combinational circuits. They have memory, and so the state of the outputs depends not just on the current inputs but also on earlier inputs. What happens to such history-dependent computational machinery when you try to throw it into reverse?

Peter G. Neumann of SRI International points out that this question was addressed by David A. Huffman of MIT almost 50 years ago, long before the better-known work on reversibility that I mentioned in my column. Huffman’s analysis was formulated in terms of finite-state machines, which are also called sequential machines because they pass through a sequence of states in response to a series of inputs. For example, one simple sequential logic machine has two states called odd and even, and it alternates between them as inputs are received. Huffman considered a subset of sequential machines that are “information lossless,” never discarding information as signals pass through the logic devices. For combinational circuits, “lossless” and “reversible” are equivalent concepts; if no information is thrown away, then the outputs determine the inputs just as much as the inputs dictate the outputs. For sequential devices, however, the situation is more complicated. Knowing the outputs of the machine is not always enough to recover the inputs, because information may also be sequestered in memory elements or the internal states of the machine. Hence there are lossless sequential circuits for which we cannot build an inverse circuit, because it would need to fetch information from the future in order to reconstruct the past. Nevertheless, Huffman showed there is always a procedure for reversing a lossless sequential machine, given enough knowledge of the internal states.

Huffman’s account of these ideas is worth revisiting after all these years. Here’s the reference: Huffman, David A. 1959. Canonical forms for information-lossless finite-state logical machines. IRE Transactions on Circuit Theory (special supplement) and IRE Transactions on Information Theory (special supplement) CT-6 and IT-5:41–59. The paper was reprinted in a volume that may be easier to track down, and that includes a number of other notable contributions to automata theory: Sequential Machines: Selected Papers, edited by Edward F. Moore, Reading, Mass.: Addison-Wesley, 1964.

One followup to Huffman’s work was a paper by Neumann himself, adapting the lossless sequential machine to the needs of error-correcting coding: Neumann, P. G. 1964. Error-limiting coding using information-lossless sequential machines. IEEE Transactions on Information Theory. IT-10:108–115.