bit-player | An amateur's outlook on computation and mathematics

Saturday’s New York Times had a story by Sam Roberts about a newly released Census Bureau study of the frequency of surnames in the U.S. The Times story was mainly about the names at the top of the list, and especially the increasing prominence of Hispanic names (Garcia and Rodriguez have made it into the top ten). But what caught my attention was a passing comment about the bottom of the frequency distribution:

Altogether, the census found six million surnames in the United States. Among those, 151,000 were shared by a hundred or more Americans. Four million were held by only one person.

I was not surprised to learn that the distribution of name frequencies is steeply skewed, with a few common names and a great many rare ones. But could it be true that two-thirds of the names occur just once in the population—that four million people in the U.S. have a unique family name they share with no one else?

Looking through the lens of personal experience, I found it hard to believe those numbers. Over the years I’ve met some people whose family names are surely rare, but I am not aware of a single acquaintance who is the holder of a unique name—if only because everyone I know shares a name with parents or children or siblings or a spouse. After all, family names tend to run in families! To have a unique name, you’ve got to be the first of your line or the last of your line or both.

The study of name distributions has a long history. In the 1870s Francis Galton and Henry William Watson looked into the longevity of family names, concluding:

All the surnames, therefore, tend to extinction in an indefinite time, and this result might have been anticipated generally, for a surname once lost can never be recovered, and there is an additional chance of loss in every successive generation.

The argument sounds good, but it’s not quite as broadly applicable as Galton and Watson thought it was. Extinction is inevitable only in a static or shrinking population. If the population is growing, names and families can become all but immortal. In the 1920s Alfred Lotka calculated that American family names had about an 18 percent chance of surviving indefinitely. More recently, Susanna C. Manrubia, Bernard Derrida and Damián H. Zanette have developed a more refined computer model of name evolution (see arXiv preprint 1 and 2; there’s also a splendid American Scientist article, but annoyingly it’s only accessible to subscribers). Manrubia, Derrida and Zanette describe an equilibrium state where the distribution of names follows a power law. If we define a “clan” as the set of all people who have a surname in common (whether or not they are actually related), then the predicted number of clans of size m is proportional to m^–β. Manrubia, Derrida and Zanette argue that β = 2. Thus, for example, clans 10 times larger should be 100 times rarer.

How do the new Census Bureau findings stack up against these predictions? Here is the frequency table included in the summary report (.pdf):

Table of frequencies of last names

For this data set the cumulative numbers are easier to work with because of the nonuniform bin sizes. Here’s how they look in a graph:

graph of cumulative name frequencies

Graphs of this kind can be confusing. I find it helpful to keep in mind that a point at coordinates x,y indicates there are y clans with x members or more.

If clan frequencies were governed by a strict power law, the graph would trace a straight line on these log-log scales. Overall, the curve is indeed fairly straight, tending to support the power-law model. But a few features of the curve seem to depart from the predictions. For one thing, the slope of the line gives an exponent closer to β = 1 than β = 2, as Manrubia, Derrida and Zanette would lead us to expect. I can’t explain that. A steepening of the curve at the large-clan end could be an artifact of finite sample size. Most interesting of all is the sudden uptick at the opposite end of the curve, where clans of size 1 are much more abundant than the power law predicts. On a logarithmic scale it’s easy to misjudge the magnitude of such a trivial-looking excursion: If the two leftmost data points (for clans of size 1 and size 2 through 4) were restored to the trend line of the data from clan sizes of 10 through 1,000, the total number of names in the survey would be about three million instead of six million, and there would be only one million unique names instead of four million.

I’ll not keep you in suspense any longer about the cause of this anomaly. When I downloaded the Census Bureau report, I found that the authors (David L. Word, Charles D. Coleman, Robert Nunziata and Robert Kominski) are also skeptical about those four million solo monikers. They explain that the data came from census forms on which respondents were asked to print the first, middle and last names of all household residents; the forms were then electronically scanned, and the answers were extracted by optical character recognition. Errors at any point in the process could turn a common name into a unique (but fictitious) one—making a MLLLER out of a MILLER, say. Some of these errors were corrected in later processing, but others apparently slipped through. One particularly troublesome problem arose whenever a respondent printed an entire name in the space intended for the surname. The OCR software simply concatenated all the parts of such a response, leading to spurious surnames such as PETERJDAVIS. The report states that “many” of the four million unique names are products of such data-entry errors, but there is no attempt to quantify the effect.

For privacy reasons, the Census Bureau has released only the 151,671 names (.zip) occurring at least 100 times, so there’s no way to get a look at the unique names. You might think, though, that if three-fourths of them are malformed in some way, that fact would stand out prominently and would have been noticed even before this study was undertaken. You might even think that if 1 percent of respondents are entering names incorrectly, the Census Bureau would have discovered that fact in preliminary testing and would have redesigned the form before circulating it to 300 million people.

Still, I suppose the Bureau’s explanation must be true. There’s spotty suggestive evidence even in the list of names appearing 100 times or more. For example, the list includes surnames such as VANBURKLEO and JOHNSONWILLIAM. And either there are 160 people in the U.S. whose surname is JOHNOSN, or there are 160 JOHNSONs who all made the same transposition error when entering their name on a census form. (Or some combination of the above.)

Even if there are only a million unique names, that still seems like a lot—one out of every 300 people. Galton and Watson looked upon such lonely surnames as dying embers, the last hope of families on the brink of extinction. But some of the rare names are surely newborns rather than expiring elders. Immigration brings names that are new to the U.S. even if they are far from unique globally. And processes akin to mutation and recombination are creating new names all the time. In particular, recombination has become more important now that the purely patrilineal model of name transmission is no longer universal; surnames have broken free from their linkage to the Y chromosome. As a matter of fact, now that I think of it, I was wrong when I said that I have never known a person with a unique surname. I have friends who named their daughter Nina Auslander-Padgham, and her surname surely has a good chance at uniqueness. Or at least it did until Nina’s brother Milo was born.

Out of curiosity, I opened up the Boston-Cambridge phone book, selected a few pages at random, and counted up unique names as a proportion of all names. In a sample of 458 surnames, 254 were listed for one person only, or about 55 percent. This result isn’t too far from the two-thirds ratio in the Census Bureau report, but I’m not sure how to interpret it. The geographic area covered by the Boston directory includes a population of roughly a million, or about 1/300th of the national population. When you select a small sample of this kind—supposing it to be a random sample—what does the selection process do to the frequency distribution of names? If a name occurs 300 times nationally, it could well be unique in Boston, thereby apparently boosting the number of unique names. On the other hand, for every 300 names that truly are unique nationally, only one is likely to be represented in Boston, so in this way the number of unique names is greatly diminished. The question I leave you with is this: How best can we estimate the national (or global) proportion of unique names from a small random sample?

Last name first