Archive for the ‘biology’ Category

A molecular millisecond

Saturday, February 6th, 2010

It was not quite a century ago that we got our first glimpse of molecules. William Lawrence Bragg, with a little help from his dad, figured out how to get molecules to sit still long enough for a portrait. First you had to crystallize the substance, then shoot x-rays through the carefully mounted crystal, then record the lace-doily pattern of diffracted rays on photographic film.

After all that lab work came the really hard part: analyzing the pattern of bright dots in order to reconstruct the positions of atoms in three-dimensional space. This was a difficult inverse problem, something like deducing the shape of a musical instrument from the sounds it emits. (The EDSAC, the first working stored-program computer, was put to work deciphering x-ray diffraction patterns circa 1950.)

Finally it came time to build a model of the molecule–and in those days a model occupied physical rather than virtual reality. In the 1970s I visited Max Perutz, who had by then spent more than 30 years working out the structure of the hemoglobin molecule. His offices were cluttered with modeling artifacts: stacked sheets of transparent plastic, marked with hand-drawn contour maps of electron density, lumpy clay and plaster extrusions showing the overall form of the protein at low resolution, and the now-familiar tinker-toy assemblies of balls and sticks.

It was hard-won knowledge, and I thought it heroic science. I still do. And yet everyone knew all along–Perutz more emphatically than anyone else–that those rigid, static models of proteins were highly misleading. In the living cell, biological macromolecules do not sit immobile like bronze statues. They are machines with moving parts; they continually flex and wiggle, mesh and then disengage, spin, flap, bend, stretch; all day long they do a hyperkinetic hokey-pokey.

I have now seen a remarkable performance of that molecular dance. In a talk at Harvard earlier this week David E. Shaw showed two videos, each portraying about a millisecond in the life of a single protein molecule. A millisecond may not sound like much, but the video was created by computing atomic motions at roughly one step per femtosecond. That’s 1012 steps in all. (If you included all the steps in the video, and displayed them at 60 frames per second, the show would go on for 500 years.)

Shaw was once a computer scientist at Columbia, then he went off to make some billions on Wall Street. (He was introduced to the Harvard audience as “King Quant.”) He has now turned to computational molecular biology, setting up his own lab and building a series of special-purpose computers designed for molecular-dynamics simulations. The machines are called Anton, in honor of Leeuwenhoek. Shaw’s group has built eight of them so far, each with 512 processors. A kiloprocessor model is expected to come on line in a few weeks.

The basic idea behind the computations is simple. Start with the initial positions and velocities of all the atoms. Calculate the force that each atom exerts on every other atom, and the resulting acceleration. Wash, rinse, repeat. For a system of N atoms, the naive version of this algorithm has performance proportional to N2; this quadratic growth is a bit of a problem, because the model includes not only several hundred atoms in the protein itself but also up to 50,000 atoms in the surrounding solvent. So Anton takes some shortcuts. The big one is to do a full accounting of pairwise interactions only for atoms within a limited radius; the distribution of more-distant atoms is remapped to a mesh of discrete points. But even after this winnowing of the problem, the calculation of pairwise forces remains the principal bottleneck. It is solved by throwing hardware at it: 32 × 512 parallel pipelines implemented on custom silicon. There’s more on Anton’s architecture and algorithms here; the Shaw Research web site lists lots of other publications as well, but most of them are not accessible without payment.

As far as I can tell, the videos of proteins in motion are not yet available anywhere, and that’s really too bad. They might well be the next dance sensation on YouTube. Watching them in the lecture hall, I was so bedazzled that I neglected to note the identity of the molecules. One was an ion channel, a protein that spans the width of a membrane and controls the passage of some specific ion (potassium, I think, in this case). We watched the six polypeptide strands twisting closed like the blades of a camera iris, shutting off the channel. Another simulation showed an even more dramatic reconfiguration. For many microseconds of biological time, and perhaps half a minute of wall-clock time, the protein sat nervously quivering and fidgeting, hunched up in a compact globule, with occasional minor adjustments to various loops and corners. And then suddenly the whole molecule opened up like a flower blooming; a moment later it closed again. If I understand correctly what Shaw was telling us, the existence of this alternative state had been known from experimental evidence, but the transformation had never been seen before. And, as he remarked early in the talk, “seeing what it looks like” brings a level of understanding that would be hard to achieve by more analytic methods.

Which brings me to my one gripe. The truth is, we still don’t know what a protein really looks like, and we never will, because “looking” is not a well-defined notion for objects smaller than the wavelength of light. Color, for example, is just not meaningful in this realm, and surface texture is also problematic. Thus schemes for depicting molecules are necessarily a matter of convention. It’s worth giving some careful thought to those conventions, choosing graphic forms that convey as much as possible about what we do know without inviting spurious inferences about what we don’t know (such as color and texture).

Some of Shaw’s illustrations use a ribbon-and-sheet scheme invented 30 years ago by Jane Richardson, which still seems to work well for showing the overall architecture of a protein. But other diagrams and videos use a ball-and-stick model to represent atomic detail, and this strikes me as a less-happy choice. Watching that jiggling assembly of balls and sticks (black for carbon, red for oxygen, etc.), I kept seeing a shiny, brittle, plastic model of a protein rather than the protein itself. Surely there are better graphic devices.

Update: Thanks to Ron Dror of D. E. Shaw Research for pointing out an error in my description of the Anton algorithm: Distant charges are not mapped to a continuum distribution but to a mesh of discrete points. (I’ve made a correction above.) Ron also notes that an article on Anton in Communications of the ACM is available in the CACM digital edition.

Flights of fancy

Tuesday, October 27th, 2009

starlings-closeup-2058.JPG

As I have mentioned in the past, I’m fascinated by the acrobatics of bird flocks, especially the big congregations of European starlings that gather in the evening at this time of year. Evidently I’m not the only one with such an interest. In the past few years the subject has attracted the attention of quite a large flock of scientists, including not only biologists but also various luminaries in physics, mathematics and computer science.

Below are some notes on a few of the recent papers, but first I have to mention a classic from 20 years ago:

Reynolds, Craig W. 1987. Flocks, herds, and schools: a distributed behavioral model. Computer Graphics 21(4):25–33. Author archive.

This is the paper that began the modern era of flocking studies by proposing that animals could coordinate and synchronize their movements without any need for a leader or external cues. Others were thinking along the same lines at about the same time, but it was Reynolds who attracted wide notice with his enchanting computer animations of “boids” soaring through an imaginary three-dimensional space. Each individual in the flock acts according to simple, local, fixed rules, and the synchronized maneuvers emerge spontaneously.

Reynolds suggested three particular rules that might guide the behavior of each bird:

  • Avoid collisions.
  • Try to match the speed and heading of nearby birds.
  • Move toward the center of the group in which you are flying.

Reynolds was working in computer graphics, and his ideas were soon taken up by movie studios and by the makers of video games. In a sense, his simulations only had to look right; they didn’t have to reflect what actually goes on in a starling’s head. But whether or not the birds were paying attention, students of animal behavior certainly were.

starlings-wide-2064.jpg

Much of the recent activity arises out of new field studies, conducted mainly by physicists.

Cavagna, Andrea, Irene Giardina, Alberto Orlandi, Giorgio Parisi, Andrea Procaccini, Massimiliano Viale and Vladimir Zdravkovic. 2008. The STARFLAG handbook on collective animal behaviour. 1: Empirical methods. Animal
Behaviour
76:217–236. Preprint.

Cavagna, Andrea, Irene Giardina, Alberto Orlandi, Giorgio Parisi and Andrea Procaccini. 2008. The STARFLAG handbook on collective animal behaviour. 2: Three-dimensional analysis. Animal Behaviour 76:237–248. Preprint.

This group, coordinated by Andrea Cavagna and Irene Giardina of the University of Rome La Sapienza, has been photographing starling flocks near the city’s main railroad station (the Termini), which is just a few blocks from the university. Using pairs of synchronized cameras, the observers have captured stereoscopic images and then applied special image-analysis software to reconstruct the three-dimensional trajectory of each bird. Similar techniques have been tried in the past, but only with small flocks (a few dozen birds). The Italian group has traced the motions of individual birds in groups of up to 2,600. The two papers cited above give technical details on how the data were gathered and analyzed.

Ballerini, Michele, Nicola Cabibbo, Raphael Candelier, Andrea Cavagna, Evaristo Cisbani, Irene Giardina, Alberto Orlandi, Giorgio Parisi, Andrea Procaccini, Massimiliano Viale and Vladimir Zdravkovic. 2008. Empirical investigation of starling flocks: a benchmark study in collective animal behaviour. Animal Behaviour 76:201–215. Preprint.

Ballerini, Michele, Nicola Cabibbo, Raphael Candelier, Andrea Cavagna, Evaristo Cisbani, Irene Giardina, Vivien Lecomte, Alberto Orlandi, Giorgio Parisi, Andrea Procaccini, Massimiliano Viale and Vladimir Zdravkovic. 2008. Interaction ruling animal collective behavior depends on topological rather than metric distance: Evidence from a field study. Proceedings of the National Academy of Science of the USA 105:1232–1237. Open access.

And here the same authors (with a few additions) report their results and conclusions. They base their interpretation on a computational model that is recognizably a descendant of the Reynolds scheme, but with one crucial modification. Reynolds and others assumed that each bird is influenced by all other birds within some fixed distance (a “metric neighborhood”); Ballerini et al. get a closer match to the data by assuming that a bird attends to the motions of a fixed number of near neighbors, regardless of distance (a “topological neighborhood”). In other words, the graph of interacting birds has nearly constant vertex degree; the typical degree is probably six or seven. The main significance of this algorithmic change is that it helps maintain the cohesion of the flock in spite of large variations in density.

Hildenbrandt, Hanno, Claudio Carere and Charlotte K. Hemelrijk. 2009. Self-organised complex aerial displays of thousands of starlings: a model. arXiv:0908.2677v1

Those same flocks at Termini have a role in this study as well; the model presented here draws on data from Ballerini et al. as well as videotapes made at Termini by Carere. (Carere is another physicist at Sapienza; Hildenbrandt and Hemelrijk are biologists at the University of Groningen.)

The model works on the same essential principles, but it differs in intellectual style and emphasis. Hildenbrandt et al. want to account for specific details of a flock’s behavior—not just the general tendency to fly in close formation but also the particular shapes of starling flocks, the maneuvers they perform, the altitudes they prefer, and so on. Reaching for this verisimilitude leads to a rather complicated model with many parameters in need of fine tuning, such as aerodynamic properties of the bird’s wing and body and banking angles in turns. Hildenbrandt et al. report some success in explaining the geometry of flocks (they tend to be horizontally flattened rather than spherical). They do less well in an attempt to account for an extra-dense layer of birds observed at the periphery of a flock.

starlings-landing-2072.jpg

Cucker, Felipe, and Steve Smale. 2007. Emergent behavior in flocks. IEEE Transactions on Automatic Control 52:852–862.

Chazelle, Bernard. 2009. Natural algorithms. Proceedings of the 20th Symposium on Discrete Algorithms, pp. 422-431. Preprint.

Chazelle, Bernard. 2009. The convergence of bird flocking. arXiv:0905.4241v1

Leaving behind the breathy wing-beats of living starlings, we enter a world of mathematical abstractions.

Cucker and Smale, peripatetic mathematicians currently at the City University of Hong Kong, take a stripped-down model of flocking and ask this question: Is it guaranteed that all the birds in the flock will eventually settle on the same velocity, and thus fly together forever? Chazelle, a theoretical computer scientist at Princeton, asks a follow-on question: If the birds do converge on the same speed and heading, how long might it take for them to do so, in the worst case?

The answer to the Cucker-Smale question turns out the be yes: Given certain preconditions and parameter values, convergence is certain. But Chazelle shows that it can take quite a while for the flock to reach consensus. For n birds adjusting their velocities in discrete steps, the upper bound is 2 ↑↑ (4 log n) steps. As I was saying just the other day, this up-arrow notation denotes an exponential tower of 2s with, in this case, 4 log2 n levels. In other words, in a flock of a thousand birds, the convergence time is roughly

\[2^{2^{2^{\cdot^{\cdot^{\cdot^2}}}}}\]

with 40 levels of exponentiation. This is a ridiculous number, far exceeding the lifetime of a starling (or of a universe, for that matter). As Chazelle notes: “Our bounds obviously say nothing about physical birds in the real world. They merely highlight the exotic behavior of the mathematical models.”

It is rather wonderful to reflect—as you stand in a field of corn stubble admiring the flocks of birds wheeling overhead in the evening sky—that these avian entertainments should be the starting point for a line of reasoning that ventures so far into the wild blue yonder of inexpressible numbers.

Lebar Bajec, Iztok, and Frank H. Heppner. 2009. Organized flight in birds. Animal Behaviour 78:777–789. Preprint.

I mention this piece last, but it would actually be a good place to start if you want a primer on flocking. Frank Heppner, a biologist at the University of Rhode Island, is one of the pioneers of flocking-and-swarming studies; here, with a mathematical colleague from the University of Ljubljana, he reviews many of the recent contributions and puts them in historical context. The review includes a discussion of the more crystalline flying formations of large birds such as geese as well as the amorphous flocks of starlings.

Argiope aurantia

Wednesday, October 7th, 2009

Argiope-aurantia-2499.jpg

It’s orb-weaving season in my part of the world. Out in the ivy, I have four webs of the golden orb weaver, Argiope aurantia, all within one square meter.

The engineering talents of all the orb weavers are impressive, but what attracts the eye to these particular webs is that bizarre zig-zag decoration, known as a stabilimentum. What’s it for? Does it attract prey? Or mates? Does it camouflage the spider? Does it make the spider look larger than it is, to discourage predators? Does it make the web more conspicuous, to ward off inadvertent damage from passing birds or mammals? Maybe it’s just a skein of spare silk? Or a sunscreen.

Someday we may know the answer, but the spider never will.

Life Curves

Sunday, August 24th, 2008

J. John Sepkoski, Jr., was a fossil-hunter who did most of his digging in the library, sifting through the literature of paleontology to build a detailed, quantitative timeline of life on earth. Focusing on marine animals, he recorded the earliest and the latest known appearances of thousands of ancient organisms. The final edition of his compendium, published in 2002 (three years after his death at age 50), lists dates for more than 36,000 genera.

A few years ago I had a chance to get closely acquainted with Sepkoski’s compendium, when I needed a machine-readable version of the timeline. The listings were published on CD-ROM (remember those?), but the files were merely unstructured plain text. I needed something I could compute with, and so I spent a week or two reformatting the records and importing them into a database. (Others have done the same thing. Shanan Peters of the University of Wisconsin–Madison maintains an online version.)

Here is the summary graph that was the goal of my data-conversion project; it shows the number of extant genera as a function of time, according to Sepkoski’s tally of comings and goings:

Spekoski.png

My brief hands-on experience with Sepkoski’s compilation gave me a sense of how much care went into its preparation. Getting any large data collection into a computer tends to be a fiddly process. Irregularities that a human reader would hardly notice are sand in the gears of automated text processing. Sepkoski’s data files caused less trouble than I expected. The problems I encountered were mainly trivial typographic anomalies—missing punctuation, erratic spacing—and even those were surprisingly rare. The only hints of potentially meaningful errors were a dozen pairs of duplicated entries, where the same genus appeared twice in the listings. It’s easy to see how that would happen in a project that went on for almost three decades; indeed, it’s amazing there weren’t more duplicates.

In any case, I came away from this project with great respect for Sepkoski’s accomplishment, but that doesn’t mean that the curve reproduced above represents the final word on the history of life. It’s not even clear that the main features of the curve and its overall shape give an accurate portrait of changes in global biodiversity.

In constructing any such historical time series, certain biases and distortions are hard to overcome. Of particular importance in this case, fossils from more recent intervals are more likely to survive and to be discovered than those from more ancient times. This “pull of the recent” effect raises questions about the steep upward trend that dominates the Sepkoski curve from the Cretaceous to the present. Has evolution really been going crazy with innovation throughout the past 150 million years, or is that hockey-stick curve an artifact of preservational and sampling bias?

A newly completed analysis of another big fossil database addresses this question (and others). The data source for the new analysis is the Paleobiology Database, a large collaborative project coordinated by John Alroy of the University of California–Santa Barbara. The Paleobiology Database might be called a metacompilation: It brings together statistical and descriptive information from thousands of more-specialized fossil collections (83,444 at the latest count). Initial work on the database began a decade ago (Sepkoski was an early contributor), but it has shown a recent growth spurt.

Of course the new database is vulnerable to the same kinds of systematic bias that Sepkoski had to confront. There’s no avoiding the fact that, on the whole, younger geological strata are more accessible and better studied, and younger fossils are better preserved. But by organizing the data differently and retaining more information about each taxonomic group, Alroy and his colleagues see an opportunity to correct or compensate for some of the biases. Of particular note, whereas Sepkoski recorded only the first and last known appearance of each genus, Alroy et al. attempt to keep track of every occurrence of an organism. This extra information allows sampling bias to be estimated and corrected.

Consider these hypothetical fossil records, where each dot represents a single occurrence of a fossil organism in one of nine labeled intervals:

Alroy.png

In both cases Sepkoski’s protocol would merely indicate that the taxonomic group originated in period 3 and became extinct in or after period 8. The new database records each time unit in which the fossil was found and, whenever possible, the number of occurrences per interval. This data might seem like superfluous detail. After all, if an organism was alive in periods 3 and 8, we can safely infer that it must have existed in periods 4, 5, 6 and 7 as well, whether or not fossil evidence has come to light. But it turns out that recording occurrences rather than just chronological ranges allows for some helpful statistical magic.

As I understand it, the scheme works something like this. Suppose we could gather together all the fossils ever collected by paleontologists, and sort them into bins according to age. Because of the various sampling and preservational biases, the bins for fairly recent periods (say 50 million years ago, in the Tertiary) would be much fuller than the bins for earlier times (say 400 million years ago, in the Devonian). Any bin with more specimens would be likely to exhibit more diversity as well, simply because rare organisms have a better chance of showing up at least once in a larger sample. But we can control for this bias through a simple subsampling procedure: Draw a fixed number of specimens from each bin, making each selection at random and with replacement. The counts of genera in the subsamples should reflect the true diversity of the biota in each bin.

In practice it gets more complicated than that, because we can’t actually sample the entire fossil record at the level of individual specimens; the best we can do is to randomly choose collections of fossils or the publications that describe them. And the publications vary greatly in how much quantitative data they include; some are just lists of species observed.

After many adjustments, refinements and calibrations, Alroy and 34 co-authors have published a diversity curve based on the subsampling technique:

Alroy.png

(Graph courtesy of John Alroy.)

Their article (subscription required) appeared last month in Science, along with 67 pages of supplementary material.

The Sepkoski and the Alroy graphs are twins separated at birth—widely separated. The overall upward trend still exists in the newer graph, but it is much less dramatic, especially in the past 100 million years. Some of the famous mass-extinction events, such as those at the end of the Permian (P) and at the end of the Cretaceous (K), are visible in the new graph but are altered in character; instead of a sudden crash after a sustained build-up, we see something more like a return to normal after a brief, sharp spike in diversity. (Alroy elaborates on the dynamics of mass extinctions in a second recent article, this one in PNAS.)

Looking at the two curves, I arrive at this question: How is the interested but nonexpert reader to evaluate these contrasting views of our planetary past? I want to emphasize that the question animating me is not “Who is right?” but “How can we know who is right?” Is there some way that the ordinary, scientifically literate outsider can form a reasoned judgment about such competing claims to truth?

It was questions like these that got me in trouble the last time I wandered into this area. In 2005 Richard A. Muller of the Lawrence Berkeley National Laboratory and Robert A. Rohde, a graduate student at UC Berkeley, published a report in Nature claiming to detect periodic cycles of rising and falling diversity in the Sepkoski data. Applying Fourier analysis to the time series, they reported finding a strong signal at a period of 62 million years and a weaker one at 140 million years. The claim was controversial from the start, and I decided to take a do-it-yourself approach to understanding the issue. I went back to the original data, reimplemented the analytic methods and tried to assess the robustness of the conclusion. I told the story in an American Scientist column.

The column pleased no one. It certainly didn’t please Muller and Rohde, who objected that I was out of my depth in my amateur attempt to replicate their work. It didn’t please the critics of the Muller-Rohde hypothesis, who thought my focus on certain narrow technical issues deflected attention from deeper conceptual flaws in the argument. And it didn’t please me, because I agreed with the criticisms from both sides.

I should also mention that my column had zero impact on the controversy, which not only continues to rage but has also been extended to the new database. Alroy writes in the PNAS article that some of the peaks and valleys forming the supposed cycles fail to materialize in the new data set. On the other hand, a preprint from Adrian L. Melott of the University of Kansas argues that cycles with periods of 62 and 150 million years emerge from the Paleobiology Database with higher statistical significance than they had in the Sepkoski collection.

All in all, I think I’ll sit this one out. I’ve been itching to get my hands on some records from the new database and implement the subsampling algorithm (which sounds both intriguing and readily accessible). It would be fun to play with these ideas. But I’ll let someone else have the fun this time.

Science builds its credibility on the bedrock idea that experiments and other kinds of results are subject to independent confirmation or refutation. And the advent of computational science has made this egalitarian ideal much more practical than it used to be. Although experiments in high-energy physics remain beyond the means of most amateurs, anything done with a computer rather than a particle accelerator is pretty much fair game these days. Still, there are bounds. If every reader set out to replicate every experiment, the world wouldn’t make much progress.

Last name first

Tuesday, November 20th, 2007

Saturday’s New York Times had a story by Sam Roberts about a newly released Census Bureau study of the frequency of surnames in the U.S. The Times story was mainly about the names at the top of the list, and especially the increasing prominence of Hispanic names (Garcia and Rodriguez have made it into the top ten). But what caught my attention was a passing comment about the bottom of the frequency distribution:

Altogether, the census found six million surnames in the United States. Among those, 151,000 were shared by a hundred or more Americans. Four million were held by only one person.

I was not surprised to learn that the distribution of name frequencies is steeply skewed, with a few common names and a great many rare ones. But could it be true that two-thirds of the names occur just once in the population—that four million people in the U.S. have a unique family name they share with no one else?

Looking through the lens of personal experience, I found it hard to believe those numbers. Over the years I’ve met some people whose family names are surely rare, but I am not aware of a single acquaintance who is the holder of a unique name—if only because everyone I know shares a name with parents or children or siblings or a spouse. After all, family names tend to run in families! To have a unique name, you’ve got to be the first of your line or the last of your line or both.

The study of name distributions has a long history. In the 1870s Francis Galton and Henry William Watson looked into the longevity of family names, concluding:

All the surnames, therefore, tend to extinction in an indefinite time, and this result might have been anticipated generally, for a surname once lost can never be recovered, and there is an additional chance of loss in every successive generation.

The argument sounds good, but it’s not quite as broadly applicable as Galton and Watson thought it was. Extinction is inevitable only in a static or shrinking population. If the population is growing, names and families can become all but immortal. In the 1920s Alfred Lotka calculated that American family names had about an 18 percent chance of surviving indefinitely. More recently, Susanna C. Manrubia, Bernard Derrida and Damián H. Zanette have developed a more refined computer model of name evolution (see arXiv preprint 1 and 2; there’s also a splendid American Scientist article, but annoyingly it’s only accessible to subscribers). Manrubia, Derrida and Zanette describe an equilibrium state where the distribution of names follows a power law. If we define a “clan” as the set of all people who have a surname in common (whether or not they are actually related), then the predicted number of clans of size m is proportional to m–β. Manrubia, Derrida and Zanette argue that β = 2. Thus, for example, clans 10 times larger should be 100 times rarer.

How do the new Census Bureau findings stack up against these predictions? Here is the frequency table included in the summary report (.pdf):

Table of frequencies of last names

For this data set the cumulative numbers are easier to work with because of the nonuniform bin sizes. Here’s how they look in a graph:

graph of cumulative name frequencies

Graphs of this kind can be confusing. I find it helpful to keep in mind that a point at coordinates x,y indicates there are y clans with x members or more.

If clan frequencies were governed by a strict power law, the graph would trace a straight line on these log-log scales. Overall, the curve is indeed fairly straight, tending to support the power-law model. But a few features of the curve seem to depart from the predictions. For one thing, the slope of the line gives an exponent closer to β = 1 than β = 2, as Manrubia, Derrida and Zanette would lead us to expect. I can’t explain that. A steepening of the curve at the large-clan end could be an artifact of finite sample size. Most interesting of all is the sudden uptick at the opposite end of the curve, where clans of size 1 are much more abundant than the power law predicts. On a logarithmic scale it’s easy to misjudge the magnitude of such a trivial-looking excursion: If the two leftmost data points (for clans of size 1 and size 2 through 4) were restored to the trend line of the data from clan sizes of 10 through 1,000, the total number of names in the survey would be about three million instead of six million, and there would be only one million unique names instead of four million.

I’ll not keep you in suspense any longer about the cause of this anomaly. When I downloaded the Census Bureau report, I found that the authors (David L. Word, Charles D. Coleman, Robert Nunziata and Robert Kominski) are also skeptical about those four million solo monikers. They explain that the data came from census forms on which respondents were asked to print the first, middle and last names of all household residents; the forms were then electronically scanned, and the answers were extracted by optical character recognition. Errors at any point in the process could turn a common name into a unique (but fictitious) one—making a MLLLER out of a MILLER, say. Some of these errors were corrected in later processing, but others apparently slipped through. One particularly troublesome problem arose whenever a respondent printed an entire name in the space intended for the surname. The OCR software simply concatenated all the parts of such a response, leading to spurious surnames such as PETERJDAVIS. The report states that “many” of the four million unique names are products of such data-entry errors, but there is no attempt to quantify the effect.

For privacy reasons, the Census Bureau has released only the 151,671 names (.zip) occurring at least 100 times, so there’s no way to get a look at the unique names. You might think, though, that if three-fourths of them are malformed in some way, that fact would stand out prominently and would have been noticed even before this study was undertaken. You might even think that if 1 percent of respondents are entering names incorrectly, the Census Bureau would have discovered that fact in preliminary testing and would have redesigned the form before circulating it to 300 million people.

Still, I suppose the Bureau’s explanation must be true. There’s spotty suggestive evidence even in the list of names appearing 100 times or more. For example, the list includes surnames such as VANBURKLEO and JOHNSONWILLIAM. And either there are 160 people in the U.S. whose surname is JOHNOSN, or there are 160 JOHNSONs who all made the same transposition error when entering their name on a census form. (Or some combination of the above.)

Even if there are only a million unique names, that still seems like a lot—one out of every 300 people. Galton and Watson looked upon such lonely surnames as dying embers, the last hope of families on the brink of extinction. But some of the rare names are surely newborns rather than expiring elders. Immigration brings names that are new to the U.S. even if they are far from unique globally. And processes akin to mutation and recombination are creating new names all the time. In particular, recombination has become more important now that the purely patrilineal model of name transmission is no longer universal; surnames have broken free from their linkage to the Y chromosome. As a matter of fact, now that I think of it, I was wrong when I said that I have never known a person with a unique surname. I have friends who named their daughter Nina Auslander-Padgham, and her surname surely has a good chance at uniqueness. Or at least it did until Nina’s brother Milo was born.

Out of curiosity, I opened up the Boston-Cambridge phone book, selected a few pages at random, and counted up unique names as a proportion of all names. In a sample of 458 surnames, 254 were listed for one person only, or about 55 percent. This result isn’t too far from the two-thirds ratio in the Census Bureau report, but I’m not sure how to interpret it. The geographic area covered by the Boston directory includes a population of roughly a million, or about 1/300th of the national population. When you select a small sample of this kind—supposing it to be a random sample—what does the selection process do to the frequency distribution of names? If a name occurs 300 times nationally, it could well be unique in Boston, thereby apparently boosting the number of unique names. On the other hand, for every 300 names that truly are unique nationally, only one is likely to be represented in Boston, so in this way the number of unique names is greatly diminished. The question I leave you with is this: How best can we estimate the national (or global) proportion of unique names from a small random sample?

Boidland

Friday, November 2nd, 2007

starlings_2059.JPG

Above: A throbbing, wheeling mob of several thousand restless starlings, near a strip mall in Clayton, North Carolina, 27 October 2007. Below: Snow geese on maneuvers near Ashburn, Missouri, 12 November 2004.

geese_9869.JPG

In the 1930s, Edmund Selous argued that flocking behavior could be explained only through some form of animal ESP: “thought transference” was the only way that birds could communicate their intentions quickly enough to coordinate their movements. Others suggested there must be some designated leader along the birds, a conductor or drill sergeant whose cues the rest of the flock followed. By the 1980s a much simpler view emerged. Flocking birds (and also schooling fish and swarming insects) could maintain their formations without any need for leaders or a synchronizing central authority if each individual followed a few simple rules.

This idea of a leaderless flock with distributed intelligence was proposed by several authors at about the same time, but the version that made the biggest splash came from Craig J. Reynolds, whose famous “boids” made their premier just 20 years ago, at the SIGGRAPH meeting in 1987. The simulation was described in a paper in the conference proceedings, and the source code is also available online. The program comes to about 3,000 lines of Lisp, written in the Zetalisp dialect of the Symbolics Lisp machine. These days, you can achieve the same effect with much less effort in multiagent programming environments such as StarLogo, NetLogo and breve.

All of the boids in a flock were guided by the same three rules:

  • Separation: Avoid collisions, and try to equalize distance to nearest neighbors.
  • Alignment: Turn to match the average heading of nearby boids.
  • Cohesion: Move toward the mean position (or center of mass) of the flock.

When I first saw boids in action, I found them marvelously lifelike. Now, after spending a fair amount of time playing with models of this kind, I find the concept so familiar and so thoroughly internalized that I have a hard time seeing flocks in any other context. As I stand at the edge of an autumn field with clouds of starlings swirling around me, I think: How boidlike they are! Just as earlier observers were sure they saw leaders in the flock, or signs of spooky avian brainwaves, I can’t help seeing Separation, Alignment, Cohesion.

This weekend I am off to Florida to observe the self-organizing rituals of another flock. I’m attending the annual meeting of Sigma Xi, the society that publishes American Scientist. I’m at the meeting mainly to cheer friends who will be inducted as honorary members of the society. But it’s also interesting to observe how any such flock orients itself and chooses a direction for the future.

The family tree

Monday, October 22nd, 2007

When is a tree (large, woody plant) not a tree (connected acyclic graph)?

crepe myrtle with anastomosing stems

This has something or other to do with the topic of the previous post.

(The tree (?) is a crepe myrtle near the campus of North Carolina State University in Raleigh.)

How many of your ancestors are you related to?

Sunday, October 21st, 2007

David Aldous asked me that question over lunch one day. I didn’t have an answer, so he explained: In the simplest model of human genetics, you get half your genes from each parent, a fourth from each grandparent, and so on. Thus the fraction of your genes contributed by each member of the nth generation is 1/2n. But there must be some value of n for which 2n exceeds the total number of genes in the human genome. Suppose you have 50,000 genes. Well, 16 generations ago you had 216 = 65,536 ancestors, so roughly 15,000 thousand of those family members were left out of the lottery. They’re your ancestors, but you inherited no genes from them.

There’s also a value of n for which 2n is greater than the entire human population, so if you look back far enough, you have more ancestors than there were people on the planet. This has got to be a sign of something awry in the model; these calculations are not to be taken as a quantitatively accurate guide to the human family tree. Nevertheless, the idea of counting genes and counting ancestors is basically sound.

These issues have come to the fore lately with news coverage of the discovery that Barack Obama is a distant cousin of Dick Cheney. According to Lynne Cheney, both are descended from Mareen Duvall, a 17th-century Hugenot. In today’s New York Times, Nicholas Wade comments on the significance (or otherwise) of this genealogical connection:

Mr. Obama probably inherited a minute fraction — one divided by two to the 11th power — of Mareen Duvall’s genome, which would amount to less than one gene, assuming the Y chromosome was not inherited.

Alas, though the concept is right, the numbers don’t quite add up. Two to the 11th power is only 2,048, and we surely have more genes than that. Under simple assumptions of random assortment, the expected number of genes passed down to the eleventh generation would be ten or so.

Wade correctly notes that the candidate and the vice president are very unlikely to have inherited any of the same genes from their common ancestor. Not that I would change my vote just because they had a few snippets of DNA in common.

The green fuse

Friday, January 12th, 2007

The spirals and whorls seen in sunflowers, pine cones and various other plant structures have long held a special fascination for mathematicians and for biologists with a mathematical bent. After all, you can find Fibonacci numbers in those natural patterns—who could resist? But it’s not just Golden Ratio mysticism that accounts for this interest. More important, I think, is the mere fact that these patterns are simple and orderly enough that we have some hope of understanding them at a deep level. If you want to build a general theory of plant form and growth, then the process that yields the spiral patterns—called phyllotaxis—is a good place to start.

The Joint Mathematics Meetings in New Orleans had an especially good session on phyllotaxis, organized by SIAM, the Society for Industrial and Applied Mathematics. I learned a lot.

Plant stems and roots grow mainly from the tip, from a region of rapid cell division called the apical meristem. It’s not hard to see how this causes elongation of a shoot, but what about branching? Buds and florets and various other structures do not just arise at random as a plant grows; they are spaced at regular intervals, often in a helical pattern. For example, if you number the branches from top to bottom along a plant stem, then as you visit the branches in numerical order, you’ll find you are also going around the stem repeatedly. The divergence angle between successive branches is a crucial factor in determining the overall geometry of the plant. If the angle is 90 degrees, say, then the plant will have fourfold symmetry, and branch n will always lie directly above branch n+4. Supposedly, the divergence angle is often near 137.5 degrees, which is the “golden angle,” dividing a circle into two pie slices whose central angles are in Fibonacci ratio (the limit of the series 1/1, 1/2. 2/3, 3/5, 5/8…).

How do plants accomplish this trick? A basic idea formulated in the 19th century (and perhaps glimpsed even earlier) is that any emerging branch inhibits the growth of other branches nearby; thus a new branch can begin developing only after expansion or elongation has created enough space to make room for it. Alan Turing, more than 50 years ago, suggested a chemical mechanism that might account for this effect.

This approach to understanding plant growth is by now textbook material, but there were some ideas presented in the six talks of the New Orleans session that came as news to me.

For starters, Turing got it backwards (which is not at all the same as getting it wrong). Turing’s model of biological development supposed that a few isolated hotspots produce a chemical growth factor, which then diffuses throughout the tissue; the gradient of concentration controls growth, with new buds appearing only where the concentration exceeds some threshold. The evidence now suggests that plant growth factors (called auxins) are produced by all cells at roughly the same level, and the concentration gradients arise not from passive diffusion but from active transport. Cells pump the auxins “uphill,” toward regions where their concentration is already elevated. Thus there is positive feedback: Abundant auxin attracts still more of it. If this mechanism operated without opposition, all the auxin would eventually accumulate in one place; the counterbalance is the continual creation of new cells, which has the effect of diluting concentration. In New Orleans Eric Mjolsness of the University of California Irvine presented these results, which have just been published in the Proceedings of the National Academy of Sciences. In a follow-on talk Przemyslaw Prusinkiewicz of the University of Calgary presented an algorithmic model based on the experimental results; this work too has recently appeared in PNAS.

Whereas the auxin-pump mechanism fills in some intricate biomolecular details, another line of work highlights a model of phyllotaxis that is simpler and than most others. Pau Atela and Christophe Golé of Smith College illustrated the idea with a penny game. Start with some pennies at the bottom of a sheet of paper, arrayed randomly but neither overlapping nor separated by more than one diameter. Now add pennies one at a time, always at the lowest available position on the paper. Each newly placed penny will be tangent to two others already present. (Tangency to three or four neighbors is possible but vanishingly rare.) A few stages of the process are illustrated below, where each newly added penny is shown in red.

penny-game phyllotaxis

If we interpret the rectangular sheet of paper as an unrolled cylinder, then the patterns produced in this way mimic phyllotaxis. The vertical position of a penny represents the height of a branching or budding point along a cylindrical stem; horizontal position corresponds to angle around the stem. (Note that on the unrolled cylinder left and right edges are identified, so that a penny going off the right side of the sheet comes back at the same height on the left.) Atela and Golé show by analysis and by numerical simulation that periodic branching patterns generally emerge even from random starting positions. If I understand correctly, they find that the famous golden angle is not very common when they measure the angle between individual successive branch points; on the other hand, the average angle does seem to converge on a value in the neighborhood of 137 degrees.

Although the penny model of phyllotaxis was new to me, it is not really real new at all. The model originated with work by Mary and Robert Snow in the 1930s and has been studied by several others since then, including Stéphane Douady, who also spoke at the New Orleans session. For more details see the excellent phyllotaxis web site assembled by Atela and Golé and their students in conjunction with an exhibition at Smith in 2002–2003.

About the pretentiously literary title of this post: I know it’s right on the tip of your tongue…. Yes, that’s right, it’s Dylan Thomas:

The force that through the green fuse drives the flower
Drives my green age; that blasts the roots of trees
Is my destroyer.
And I am dumb to tell the crooked rose
My youth is bent by the same wintry fever.