The problem of describing trees

April 30th, 2008

When I finished writing about the Zeno wagering game recently, I had some trees left over, so I thought I would try planting them here.

In the Zeno article I was trying to understand and explain the structure of this peculiar-looking tree, which I’ll call the tangled tree:

zenotree.jpg

As a warmup exercise, I started out with this simpler example, the combed tree:

binarytree.jpg

These are both binary trees: Each node has exactly two descendants. And the trees also share a basic internal symmetry. In both structures, if a node has horizontal coordinate x, the two descendant nodes are symmetrically arrayed at x-d and x+d. The difference between the two trees is all in how the displacement distance d is calculated. In the combed tree, d has the initial value 1/4 at the root node, and it is reduced by half at each lower level of the tree. In the tangled tree, the formula is d = 1/2 min(x, 1-x). In other words, d is half the distance to the nearer boundary of the interval [0,1].

There was a third tree I didn’t have room to include in the article. I had stumbled upon it when I made a programming error. This rogue structure looks like this, and I’m calling it the braided tree:

crosstree.jpg

The nature of my programming error is not hard to see. Intending to generate the tangled tree, I had meant to calculate the positions of the child nodes as follows:

      left = x - 1/2 min(x, 1 - x)
      right = x + 1/2 min(x, 1 - x)

But instead I carelessly wrote:

      left = x - 1/2 x
      right = x + 1/2 (1 - x)

Stupid goof, but the result is visually intriguing. And although the braided tree is a botched version of the tangled tree, it actually has a lot in common with the combed tree. In particular, both the combed and the braided trees have exactly the same set of nodes; the difference is that those nodes are wired differently.

One way to understand the difference is to look at the paths leading from the root to any given node. Each such path is a sequence of left and right turns. For example, in the combed tree you can get from the root to the node at x = 3/16 by turning left at the root, then left again at 1/4 and finally right at 1/8. In the braided tree the path to 3/16 goes right to 3/4 then left to 3/8 and left to 3/16. Paths of this kind can be encoded as binary numerals, letting a 0 stand for each left turn and a 1 for each right turn. In this scheme the two paths described above are 001 and 100.

In essence, the braided tree is a permutation of the combed tree. If you list all the nodes at some fixed level in the combed tree (say those three levels below the root, where all denominators are 16), then the corresponding paths are in numerical order:

    1/16    000    (0)
    3/16    001    (1)
    5/16    010    (2)
    7/16    011    (3)
    9/16    100    (4)
   11/16    101    (5)
   13/16    110    (6)
   15/16    111    (7)

Here’s the corresponding tabulation for the braided tree:

    1/16    000    (0)
    3/16    100    (4)
    5/16    010    (2)
    7/16    110    (6)
    9/16    001    (1)
   11/16    101    (5)
   13/16    011    (3)
   15/16    111    (7)

The order is “counting from the left,” incrementing the most significant bit first. I wonder what other permutations give interesting-looking trees.

Turning back to the tangled tree that began this whole investigation, I would argue that it looks tangled and unruly not so much because lots of edges cross (there are even more crossings in the braided tree) but because of many near collisions and coincidences. For example, just below the root of the tangled tree, the mirror-image paths <left-right-right> and <right-left-left> almost converge on the same value, but they miss by just a bit. We can fix this! All it takes is an adjustment in the formula for the displacement d. In the tangled tree that displacement is 1/2 min(x, 1-x). The constant 1/2 in this expression is quite arbitrary, and we can substitute any other value a, with 0 < a < 1.

So what is the magic value of a in the following tree, which I am inclined to call the crocheted tree?

phitree.jpg

Note: If the title of this post fails to ring a bell, see the poem by Robert Hass:

The gene pool threw up a wobbly stem
And the tree danced. No.
The tree capitalized.
No. There are limits to saying,
In language, what the tree did.

The temblor forecast

April 15th, 2008

From the Associated Press, via the New York Times:

LOS ANGELES (AP) — California faces an almost certain risk of being rocked by a strong earthquake by 2037, scientists said in the first statewide temblor forecast.

New calculations reveal there is a 99.7 percent chance a magnitude 6.7 quake or larger will strike in the next 30 years. The odds of such an event are higher in Southern California than Northern California, 97 percent versus 93 percent.

caquake.jpg

I read this report with a certain sense of wonder. What impressed me was not the prediction itself; it’s not the first time I’ve heard that the Big One is coming. What took me by surprise was the level of mathematical sophistication that we can now take for granted in readers of the morning newspaper. No more do we have to worry that people will add up 97 percent and 93 percent to get 190 percent. Evidently, we’ve reached a state of universal numeracy, where everyone knows how to combine probabilities, and there’s no need to explain the calculation. We don’t even need to remind anyone that when we compute 1 – (1 – p)(1 – q), or p + qpq, we are assuming that p and q represent probabilities of statistically independent events; everybody knows that. And everybody understands that in this context “a chance of a quake” really means “a chance of at least one quake.”

I guess the only place where we might still stumble is in actually doing the arithmetic. My calculator tells me the number is 99.8 percent, not 99.7.

A further note: The original report on which the news item is based leaves me even more perplexed. The probability model adopted in the forecast is explained as follows:

The simplest assumption is that earthquakes occur randomly in time at a constant rate; i.e., they obey Poisson statistics. This model, which is used in constructing the national seismic hazard maps, is “time independent” in the sense that the probability of each earthquake rupture is completely independent of the timing of all others. Here we depart from the… conventions by considering “time-dependent” earthquake rupture forecasts that condition the event probabilities… on the date of the last major rupture. Such models… are motivated by the elastic rebound theory of the earthquake cycle…; they are based on stress-renewal models, in which probabilities drop immediately after a large earthquake releases tectonic stress on a fault and rise as the stress re-accumulates due to constant tectonic loading of the fault.

In other words, it doesn’t sound as though the assumption of independence is even approximately satisfied. I must be missing something. The 99.7 percent combined probability is mentioned in the executive summary of the report, but I found no explanation of how that number was calculated.

Perhaps I shouldn’t worry so much. I live thousands of kilometers away in a zone of seismic serenity.

Update, several hours later: After reading a little more carefully, I think the report does assume that all possible earthquake sites are independent. At each site the probability of an event is a function of time, but it is independent of probabilities at other sites. Thus calculating a joint probability for the northern and southern parts of the state does seem to be a valid operation. And the distinction between “exactly one” and “at least one” doesn’t really enter into the matter either. That’s because the model is only valid until the next major earthquake occurs; after that, all bets are off, since the time-dependent probabilities have to be recalculated.

If this interpretation of the model is correct, I think the way the result is expressed is somewhat misleading. To say there’s a 97 percent chance in Socal and a 93-percent chance in Nocal implies there’s a high probability (90.2 percent) of seeing both events in the course of the 30-year period. But the model is no longer valid after the first quake.

I wonder if there isn’t a better way to express the concept at the heart of this story. Qualitatively, it’s easy enough to grasp: In the next 30 years there will almost certainly be a major earthquake somewhere in California, and the event is more likely to happen in the southern part of the state than in the northern part. Putting this into numbers is somewhat tricky—or at least I’ve had a lot of trouble with it. Having finally surrendered to the computer and performed a Monte Carlo simulation, I come up with this statement: There’s a 99.8 percent chance that the next major California earthquake will happen by 2037. If indeed such a quake occurs, the odds are about 57 to 43 it will hit in Southern California.

In Zeno’s footsteps

April 10th, 2008

The latest issue of American Scientist is just out, both on the newsstand and on the web. My “Computing Science” column is titled “Wagering with Zeno”; it returns to a subject mentioned briefly in an earlier column, “Follow the Money.”

Consider a random walk on the interval (0,1), where the walker moves according to these rules:

  • The walker begins at x = 1/2.
  • At each step the walker moves either left (toward 0) or right (toward 1) with equal probability.
  • The length of each step is one-half the distance to 0 or to 1, whichever is nearer. In other words, the distance moved is 1/2 min(x, 1–x).

If you want to know more about this process and why I bothered to write about it, please see the column. Here I just want to say a few words about visual aids that might be helpful in understanding how the random walk evolves and where the walker is likely to wind up.

Here is one of the illustrations that appears in the column:

four trajectories of the Zeno walk, each plotted for 10 steps

We see four trajectories, each lasting 10 steps. Although the walk is actually one-dimensional—moving back and forth along an interval of the x axis—the trajectories are plotted in two dimensions for clarity, moving down the page so that the illustration becomes a kind of discrete spacetime diagram. Even with this device, though, some of the paths overlap for part of their course. If I tried to crowd more trajectories into the figure, the overlaps would make it hard to sort out one walk from another.

In making pictures like this, the issue is not just how you explain an experiment to other people; it’s also a matter of how you come to understand the process yourself. While I was trying to make sense of the Zeno walk, I plotted lots of trajectories like these, initially by hand and then with a program that generated Postscript files. Some important features were quickly apparent. In particular, I noticed that the walker takes much bigger strides in the middle of the space than near the edges. (Of course this fact follows directly from the definition of the walk, but it doesn’t hurt to see a picture.) I also observed that most of the walks seem to spend most of their time hugging one edge or the other, seldom venturing back across the midline of the space. Even after looking at a few hundred examples, however, I didn’t feel like I had a really secure sense of how a “typical” walk would proceed.

As with many processes that evolve over time, the illustrative possibilities are much richer if we can escape the static bounds of ink on paper. And thus I was led to try creating a more dynamic and interactive display.

screen image of zenowalk applet

What you see above is not dynamic; it’s merely static pixels on glass. But you can see it all in action in this Java applet. Feel free to click on the link and go play, but do come back here afterward so we can continue the conversation.

Hello again.

Does this program succeed in conveying an intuitive sense of how the Zeno walker walks? Personally, I’ve found it helpful, although it still comes up far short of ideal. Adjusting the number of steps in each walk gives a sense of how the walker’s apparent attraction to the edges gets more extreme over time. And adjusting the rate at which paths fade into the background allows for a degree of balance between clutter and evanescence. But there may well be a better way to go about visualizing these concepts. I’d be interested in hearing suggestions.

A note on the implementation: The applet was created with the programming language called Processing, invented by Ben Fry and Casey Reas. I’ve just reviewed two books on Processing, and so this project was undertaken partly to try out the language and its programming environment. I found it very slick, and mostly trouble-free. The source code of the zenowalk program is available through a link on the applet page. (Most of the code for the buttons and sliders was cribbed from the book by Fry and Reas.)

Of course there’s always something lacking in any programming language. In this case what I missed most acutely was the exact rational arithmetic I rely on in Lisp. Processing has only single-precision floating-point numbers. In the Zeno walk, these numbers begin running out of significant digits after about 150 steps (which is why I limited the program to 145 steps). As far as I can tell, no one has yet written a Processing library for bignums and exact rationals. I guess that’s my job.

Hard covers

April 3rd, 2008

My new book has come out this week: Group Theory in the Bedroom, and Other Mathematical Diversions, Hill and Wang, xi+269 pages, $25. ISBN-10: 0-8090-5219-9, ISBN-13: 978-0-8090-5219-9, Library of Congress Call Number: T185.H39 2008.

GTiBcover200.jpgThis is a collection of essays on themes that will be familiar to many readers of bit-player.org. Indeed, the essays themselves may be familiar! Eleven of the twelve essays appeared earlier as “Computing Science” columns in American Scientist; the twelfth was published in The Sciences. But there is some new material: Appended to each chapter is an “Afterthoughts” section where I confess to errors, castigate critics, bring outmoded notions up to date, and generally try to offer readers some excuse for selling them a $25 book, when they could find most of the content free on the web.

The book has its own handsome web site (I was lucky to snap up the domain name “grouptheoryinthebedroom.com” before some speculator squatted on it). I cordially invite all of you to go over there and have a look around. Meanwhile, back here at bit-player HQ, I’m going to throw myself a little party. See you in a day or two.

’Tis pleasant, sure, to see one’s name in print; A book’s a book, although there’s nothing in’t. —Lord Byron

The Lower 48 graph

March 14th, 2008

The interesting thread of comments on Wednesday’s post about “electoral hex” led me to look more closely at the graph for the Lower 48 states:

lower48.png

I’ve constructed the graph—and it wasn’t easy, by the way—on the rule that two states are linked by an edge if they share a common boundary of more than one point. National boundaries are determined in the same way. Nodes half-shaded orange are east coast or west coast states; those half-shaded blue are north coast or south coast states. Underwater boundaries count in this scheme. Thus, for example, Illinois is adjacent to Michigan because there’s a section of boundary line between them in the middle of Lake Michigan. Similarly, Wisconsin is not a north coast state, even though it abuts Lake Superior, because it has no shared boundary with Canada.

Some properties of the graph:

  • It has 48 vertices (no surprise!), 107 edges, and 60 faces (excluding the outer, embedding, face).
  • It’s a planar graph (again no surprise).
  • All but one of the faces are triangles. The exception is a quadrilateral formed by the Four Corners states.
  • The degree sequence runs: 8 8 7 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 2 2 2 2 1. (The two octopus states are Missouri and Tennessee. The singleton is Maine.)
  • The diameter is 11, and the characteristic path length (the length of the shortest path between two vertices, averaged over all pairs of vertices) is 4.1.

What about playing hex on this graph? In standard hex, one player tries to connect east to west, and the other player has to find a path from north to south. Those rules would hardly be fair with such an asymmetrical graph. So let’s adopt a modified version of the rules that I suggested for the political primaries: The players take turns, and the first to create a pathway either from north to south or from east to west is the winner. (Barry Cipra suggested that we call this the game of pox.)

Is the game a sure win for either the first or the second player (assuming correct play on both sides)? If so, what’s the right opening move?

It’s easy to find some really bad moves. As Paul Zorn noted, the New England states are not worth bothering with, because they’re sealed off by New York. But it turns out New York is not a very smart choice either. (If you take New York, I take Pennsylvania; then, if you want to get anywhere, you have to win all three of New Jersey, Delaware and Maryland, which gives me three chances to block you.)

The shortest bridging paths are along the I-5 and I-15 corridors, out west. In particular, there are five ways to construct a path of length 3 from north-to-south with the states WA, OR, CA, ID, NV, UT, AZ. But I don’t believe there’s any way a player can force a win using these states alone.

In hex, the best opening moves are generally near the middle of the board. The analogous pox strategy would choose one of the central and highly connected states such as MO, TN or KY. This seems sensible, but the games are too complicated to work out by hand.

On a regular 7 × 7 board, hex has been fully solved: The entire game tree is known for every possible opening move, so that each cell of the board is classified as a win or a loss for the first player. (This impressive feat of analysis and computation was done by R. Hayward, Y. Björnsson, M. Johanson, M. Kan, N. Po and J. van Rijswijck of the University of Alberta.) The Lower 48 graph is roughly the same size as the 7 × 7 hex board, so a similar method would presumably be feasible. Feasible but not easy.

Another question is whether the game can end without a winner. Jim Ward asked about this in the comments to the previous post, but I didn’t really understand the full force of the question. In the game of presidential primaries—where elections can occur simultaneously in multiple states—two or more winning paths could be created at the same time. Enforcing a strict alternating-turns policy, as I’m assuming here, eliminates that problem. But there’s another way the game might possibly fail to yield a winner, namely if all the states are chosen but there’s still no path along either axis.

In hex it’s easy to show that no such drawn game is possible. You can block the progress of a north-south path only by completing an east-west path. Hex is played on a regular and symmetrical lattice whose structure is equivalent to this graph:

hexfrag.png

The graph of the Lower 48 is not quite so uniform. The diagonals don’t all go the same way, and there’s that quadrilateral with no diagonal at the Four Corners. Does that feature (or some other oddity of the graph) allow all the states to be divided into two categories without creating either a north-south or an east-west path? I don’t think so; I haven’t been able to find such an assignment. But I’m not quite sure it can’t exist.

Update 2008-03-17: By popular request, I’ve posted a PDF version of the lower-48 graph and a text file with the vertex list and edge list. The vertex and edge lists actually take the form of a Scheme definition, but the semantics should be obvious and translation to other formats should be straightforward.

Electoral hex

March 12th, 2008

Democrats here in the U.S. are having quite a primary season. With high hopes of winning the general election in November, we are deadlocked over which candidate to nominate. Should we just wait to let the delegates slug it out at the convention in August? Should the trailing candidate withdraw in the interest of party unity? Could the two contenders flip a coin to decide who gets to head the ticket and who gets to drink the bucket of warm piss?

I have a proposal. Here’s the geography of the current situation, following Obama’s victory in Mississippi yesterday (map brazenly swiped from the New York Times):

primarymap.png

The solution is obvious, no? We follow the rules of the game of hex: The nomination goes to the first candidate to form a continuous chain of states either east to west (Atlantic to Pacific) or north to south (from the Canadian border to Mexico or the Gulf Coast). Thus the three key states are Pennsylvania (which votes April 22), North Carolina (May 6) and Kentucky (May 20). Clinton could create an east-west chain by taking North Carolina, or a north-south path by winning Kentucky. For Obama, Kentucky creates an east-west link; to construct a north-south chain, Obama would have to win both North Carolina and either Kentucky or Pennsylvania.

I get to cast my ballot on May 6 in North Carolina. All I’m saying for now is that I’m not voting for Ralph Nader.

How many Sudokus?

February 28th, 2008

The answer to the question in the headline is: Too many. After I wrote about Sudoku a couple a years ago, I thought I had cured my addiction; but I’ve been a shameless backslider.

I return to the subject now, in this public confessional mode, to correct an error in my 2006 column. Citing a paper by Bertram Felgenhauer and Fraser Jarvis (pdf), I stated that there are 6,670,903,752,021,072,936,960 Sudoku solutions. This is simply the number of ways of filling in the 81 squares without violating the no-two-alike constraints in rows, columns and blocks. I then went on to discuss the symmetries of the Sudoku grid, and noted:

When all these symmetries are taken into account, the number of essentially different Sudoku patterns is reduced substantially…. although [the reduction] still leaves an impressive number of genuinely different solutions: 3,546,146,300,288, or 4×1012.

The latter number also comes from Felgenhauer and Jarvis, but I misinterpreted it. Contrary to my statement, it takes into account only a few of the simplest symmetries. The true number of essentially distinct Sudoku solutions is reported in a slightly later publication by Ed Russell and Jarvis. The correct tally is smaller by three orders of magnitude: 5,472,730,538.

Thanks to Richard E. Dickerson for pointing out my error.

It needs to be emphasized that both of these calculations count the number of solutions, not the number of puzzles. For any given solution, there are many possible puzzles, each created by erasing a different subset of the entries in the solution. How many puzzles per solution? Does anyone know? An exact count seems beyond reckoning, but random sampling ought to yield a rough estimate. For example, consider just those puzzles that have 17 initial clues (the apparent minimum). For any given solution grid, there are (81 choose 17) ways of forming a 17-clue puzzle; this works out to 128,447,994,798,305,325. Most of those puzzles will not have a unique solution, and so they are disqualified as proper Sudokus. The random sampling is needed to estimate how many of the potential puzzles are admissible.

I’d try doing the calculation myself, but I’m too busy solving Sudokus.

Update 2008-04-05: Richard E. Dickerson discovered the error mentioned above in the course of preparing his own discourse on Sudoku-solving strategies. That work now has its own web site, The Sudokerson Matrix, where you can download the 61-page PDF file and an accompanying spreadsheet. Dickerson also mentions some other resources on Sudoku strategy, including a recent book by Andrew C. Stuart that I haven’t yet seen. Dickerson comments:

Few books on Sudoku go into strategies in any depth. Stuart’s The Logic of Sudoku (Michael Mepham, 2007) is a wonderful exception. But his book and mine represent the two opposite poles of Sudoku-solving strategy. I regard many of his methods as superfluous, and I am sure that he regards most of my strategies as heretical. You need to have a close look at both.

What I find intriguing in all this is the very idea that we could have contending doctrines of Sudoku strategy. But before the talk of “heresy” gets out of hand and leads to Sudoku holy wars, I’d like to remind everyone that there’s another view of Sudoku in which all strategies become superfluous. Give me a computer and a program for solving constraint-satisfaction problems, and every 9×9 Sudoku yields in milliseconds.

Veliagate

February 28th, 2008
Veliagatewide450.jpg

Zeno of Elea—the philosopher of footraces that never end and arrows that never reach their target—seems a figure so lost in abstractions and infinities that it’s hard to imagine him living in some particular place and time. But Elea was a real town, a sixth-century B.C.E. Greek settlement on the Tyrrhenian coast of Italy. The site is now the Italian village of Velia, about 100 kilometers south of Naples.

The massive stonework wall and gate shown above, known as the Porta Rosa, lies atop a ridge separating two neighborhoods of ancient Elea. I visited several years ago, and ever since then I’ve been perplexed by that gate. I’m now going to be writing about Zeno’s hometown for an upcoming American Scientist column, and so I thought I would ask the world for help in resolving the mystery of Veliagate.

What puzzles me, of course, is the eyebrow above the arch. It looks as if the gate was originally built a meter higher, and then lowered. Why would anyone do that? I can imagine how erosion of the ridge and the roadway might have made the opening taller than it was at the outset—but what’s the harm of that? And if the city engineers did want to maintain the original headroom, wouldn’t it be much easier to build up the roadway than to lower the arch?

Or am I wrong that the two arches represent two stages in the history of the gate? Maybe the “eyebrow” is really some sort of reinforcement?

I should mention that the Porta Rosa also has a bit of topological interest. It is built at a saddle point in the ridge. The lower passage (seen in the photograph) connects two valleys, while the roadway crossing above the arch links two hilltops.

Finally, I have another question about Greek mathematics. The big literary figures of ancient Greece were mostly Athenians, and so were the historians and political writers and the major philosophers. But when it comes to the nerdy Greeks, they seem to have been scattered all over the Mediterranean world. Euclid, Diophantus and Hypatia were Alexandrians; Pythagoras came from the island of Samos; Archimedes lived in Syracuse, on Sicily; and Zeno, as already noted, was Italian. Coincidence, or shall we speculate about the different intellectual environments of the center and the periphery?

The linguistic arrow of time

February 24th, 2008

Two recent notes on the Language Log, by Sally Thomason and Mark Liberman, discuss a nutty book, The Secret History of the English Language, by M. J. Harper. I haven’t read the book, but according to the Language Loggers, Harper contends that everybody has the history of European languages totally backwards. We’ve been taught that Latin gave rise to Italian, French, Spanish and the other Romance languages, and that English comes from Germanic roots with an important dash of Romance. The real chronology is just the opposite, Harper says. Liberman gives this precis:

[T]he history, according to Harper, is that English developed into French, which developed into Provençal, which developed into Italian; and then at some point, say around 400 B.C., some Italian merchants invented Latin as a form of shorthand.

I mention this curious thesis not because I believe anyone should take it seriously, or even because I want to defend it under the constitution’s Freedom of Wiftiness clause. But there’s an interesting mental exercise here: Can we refute this notion without resorting to mere dull historical facts? Suppose we had no documentary evidence bearing on the history of languages, and we ignored giveaways such as vocabulary items that betray their time of origin. From internal clues alone, could we deduce that Latin came before French or English? Without the labels, how would we know that Old English is older than Middle English, which in turn is older than modern English?

The challenge is rather like that of doing evolutionary biology without the fossil record. Could we look at the fishes, amphibians, reptiles, birds and mammals, and from their anatomy and physiology alone determine which groups arose earlier and which later? For the biological case, there’s a widely accepted premise that a trend toward increasing complexity defines the arrow of time. The vertebrate heart, for example, has two chambers in fishes, then three in amphibians and reptiles, and four in birds and mammals. Although there are exceptional cases where this kind of reasoning will lead you astray, it seems to work more often than not.

If there’s a similar principle in linguistics, however, I don’t know what it is. When it comes to grammatical complexity, the arrow of time seems to point the other way. Latin, for example, had a more elaborate system of inflection in nouns and adjectives than the languages descended from it. English went through a similar decline in declensions, losing case and gender markers on adjectives and abandoning its thee’s and thou’s. So maybe the rule is that simpler languages come later? But that can’t be universally true, unless we accept the implausible assumption that the very first languages were immensely complicated. Furthermore, if there is a monotonic trend toward simpler syntax, where are we headed?

Many linguists would dispute the assertion that languages show a consistent tendency to become either simpler or more complex. Yes, English has lost the word endings that once marked nouns as accusative, dative, instrumental, etc.; but in compensation it has acquired a more nuanced system of prepositions and stricter rules about word order. In this view languages do not evolve from some primitive state toward greater sophistication; nor, contrary to Miss Thistlebottom’s dire predictions, do languages degenerate into brutish grunts whenever someone splits an infinitive or dangles a participle. Changes in grammatical structure could be nothing more than a random walk through the space of all possible linguistic features. But if that’s the case, then there’s not much hope of finding an intrinsic marker of priority between pairs of languages. And so Latin really could have been invented by a bunch of Italian-speaking merchants.

Even if pairwise comparisons are problematic, though, perhaps we could find a “thermodynamic” arrow of time in the overall evolutionary pattern of a large family of languages. In biology, speciation is generally a one-way process: lines that diverge almost never reconverge. Fishes and finches have a common ancestor but they will have no common descendants. Thus the graph of relations among species is a tree, with a suppositional single root (the progenitor of all living things) and lots of leaves and branches, but no closed loops. If languages evolve in roughly the same way as living organisms, then we should be able to orient ourselves along the time axis by observing whether branches split or merge. By this argument, it’s far more likely that Latin underwent fission to produce Italian, Spanish, Catalan, French, etc., than that a dozen closely related Romance languages underwent fusion to create Latin.

There are at least two problems with this line of reasoning. First, although measures of lexical or grammatical similarity yield a tree of language relations, they don’t provide a sure-fire method of identifying the root of that tree. If you gather various words for ‘100′, you might construct a tree that includes this fragment:

kemtree.png

The conjectural k’mtom form is the reconstructed Proto-Indo-European root from which all the other terms—and many more—are thought to derive. The pattern of connections in the tree is based on judgments of lexical similarity: hundred is closer to hundert than it is to cento or cent. This linkage pattern is an invariant of the topology: It survives intact no matter how you choose to present the tree geometrically. But the identification of k’mtom as the root of the tree is not something that follows directly from a comparison of the words themselves. The diagram below shows exactly the same tree:

huntree.png

It has the same nodes and the same pattern of connections between them; the only thing that has changed is the choice of which node to designate as root. And if we knew nothing else about the chronology of the languages, there would be no obvious reason for preferring one layout over the other. (But notice that we can’t produce any arbitrary tree without doing violence to the network structure. In particular, Harper’s fantasy of going from English to French to Italian to Latin doesn’t work.)

The second problem with trying to derive a chronology from the language tree is that the tree isn’t a tree; it’s a DAG, a directed acyclic graph. Languages drift apart, but then they merge again. English is a prime example: Its deepest roots are in the West Germanic languages of the Angles, the Saxons and the Frisians, but English also received important later contributions from the Danish and the Norman French. Thus if we took seriously the idea that languages undergo fission but never fusion, we would have to conclude that English was the source of all those other languages rather than the product of their merger.

Perhaps I’m missing something important. Maybe there really is some intrinsic clue to the direction of language evolution—some way of looking at the internal structure of Latin and English and saying which came first. Even if not, though, I’m not buying the idea of English as Ursprache and Latin as shorthand Italian.

Update 2008-02-28: Richard E. Dickerson of UCLA alerts me to an earlier and perhaps even more florid bit of nuttery in the same genre. It’s a book published in 1883 by Charles Lassalle: Origin of the Western Nations & Languages Showing the Construction and Aim of Punic; Recovery of the Universal Language; Reconstruction of Phoenician Geography; Asiatic Source of the Dialects of Britain; Principal Emigrations from Asia; and Description of Scythian Society. With an Appendix, Upon the Connection of Assyrian with the Languages of Western Europe and Gaelic with the Languages of Scythia. This is one of those works where the title tells all (and then some), but the complete volume is available through Google Books, and I couldn’t resist having a look. Here’s how Lassalle begins his story (I would quote briefly if that were possible, but…):

HAVING made scientific discoveries which, on account of their great importance and extent, have not been accomplished without heavy sacrifices—having, in fact, abandoned my business to follow up with more freedom, ardour and unity of action, the Scents that had offered themselves to me when following in literary leisure certain historical and linguistic researches which seemed and have turned out to be of the utmost significancy; having also recognised that I must, for a time, entirely give myself up to the study of my discoveries, or I might never arrive at the solution that was looming before me at a distance; not knowing even where my task was leading me, and, therefore, not at liberty to form an opinion whether my work would occupy me a longer or a shorter time; having arrived at the conclusion of the task I had imposed upon myself, and been successful far beyond my ambition and expectations; having, moreover, been several times stimulated and sent to seek deeper into the channels of science by the incredulity I met from many, that a commercial man could be successful upon subjects which, until now, had baffled all the efforts of learned professors, though their common sense should have told them, that upon topics so simple and technical as those of history, geography, and languages, a travelled commercial man, acquainted with most of the Western languages and some of the old ones, had, at least, as much chance to arrive at a linguistic discovery, and enlarge it upon geographical and historical bearings which his personal experience permitted him to grasp, as a sedentary professor, who, though much versed in Greek and Latin, was generally not familiar with many of our commercial Western languages, and had not the opportunity of comparing the various customs and dialects which so often meet the eye and ear of a mercantile man.

HAVING now reached a period, though not yet a full sentence, I stop.

EATCS award to Valiant

February 15th, 2008

Leslie G. Valiant, whose work on holographic algorithms was the subject of a recent column in American Scientist and a brief note here on bit-player, has won the 2008 EATCS Award of the European Association for Theoretical Computer Science. In addition to the work on holographic algorithms, the EATCS cites Valiant’s contributions of computational learning theory, neuroscience, and several areas of complexity theory, including the study of enumeration problems.