Saturday’s New York Times had a story by Sam Roberts about a newly released Census Bureau study of the frequency of surnames in the U.S. The Times story was mainly about the names at the top of the list, and especially the increasing prominence of Hispanic names (Garcia and Rodriguez have made it into the top ten). But what caught my attention was a passing comment about the bottom of the frequency distribution:
Altogether, the census found six million surnames in the United States. Among those, 151,000 were shared by a hundred or more Americans. Four million were held by only one person.
I was not surprised to learn that the distribution of name frequencies is steeply skewed, with a few common names and a great many rare ones. But could it be true that two-thirds of the names occur just once in the population—that four million people in the U.S. have a unique family name they share with no one else?
Looking through the lens of personal experience, I found it hard to believe those numbers. Over the years I’ve met some people whose family names are surely rare, but I am not aware of a single acquaintance who is the holder of a unique name—if only because everyone I know shares a name with parents or children or siblings or a spouse. After all, family names tend to run in families! To have a unique name, you’ve got to be the first of your line or the last of your line or both.
The study of name distributions has a long history. In the 1870s Francis Galton and Henry William Watson looked into the longevity of family names, concluding:
All the surnames, therefore, tend to extinction in an indefinite time, and this result might have been anticipated generally, for a surname once lost can never be recovered, and there is an additional chance of loss in every successive generation.
The argument sounds good, but it’s not quite as broadly applicable as Galton and Watson thought it was. Extinction is inevitable only in a static or shrinking population. If the population is growing, names and families can become all but immortal. In the 1920s Alfred Lotka calculated that American family names had about an 18 percent chance of surviving indefinitely. More recently, Susanna C. Manrubia, Bernard Derrida and Damián H. Zanette have developed a more refined computer model of name evolution (see arXiv preprint 1 and 2; there’s also a splendid American Scientist article, but annoyingly it’s only accessible to subscribers). Manrubia, Derrida and Zanette describe an equilibrium state where the distribution of names follows a power law. If we define a “clan” as the set of all people who have a surname in common (whether or not they are actually related), then the predicted number of clans of size m is proportional to m–β. Manrubia, Derrida and Zanette argue that β = 2. Thus, for example, clans 10 times larger should be 100 times rarer.
How do the new Census Bureau findings stack up against these predictions? Here is the frequency table included in the summary report (.pdf):
For this data set the cumulative numbers are easier to work with because of the nonuniform bin sizes. Here’s how they look in a graph:
Graphs of this kind can be confusing. I find it helpful to keep in mind that a point at coordinates x,y indicates there are y clans with x members or more.
If clan frequencies were governed by a strict power law, the graph would trace a straight line on these log-log scales. Overall, the curve is indeed fairly straight, tending to support the power-law model. But a few features of the curve seem to depart from the predictions. For one thing, the slope of the line gives an exponent closer to β = 1 than β = 2, as Manrubia, Derrida and Zanette would lead us to expect. I can’t explain that. A steepening of the curve at the large-clan end could be an artifact of finite sample size. Most interesting of all is the sudden uptick at the opposite end of the curve, where clans of size 1 are much more abundant than the power law predicts. On a logarithmic scale it’s easy to misjudge the magnitude of such a trivial-looking excursion: If the two leftmost data points (for clans of size 1 and size 2 through 4) were restored to the trend line of the data from clan sizes of 10 through 1,000, the total number of names in the survey would be about three million instead of six million, and there would be only one million unique names instead of four million.
I’ll not keep you in suspense any longer about the cause of this anomaly. When I downloaded the Census Bureau report, I found that the authors (David L. Word, Charles D. Coleman, Robert Nunziata and Robert Kominski) are also skeptical about those four million solo monikers. They explain that the data came from census forms on which respondents were asked to print the first, middle and last names of all household residents; the forms were then electronically scanned, and the answers were extracted by optical character recognition. Errors at any point in the process could turn a common name into a unique (but fictitious) one—making a MLLLER out of a MILLER, say. Some of these errors were corrected in later processing, but others apparently slipped through. One particularly troublesome problem arose whenever a respondent printed an entire name in the space intended for the surname. The OCR software simply concatenated all the parts of such a response, leading to spurious surnames such as PETERJDAVIS. The report states that “many” of the four million unique names are products of such data-entry errors, but there is no attempt to quantify the effect.
For privacy reasons, the Census Bureau has released only the 151,671 names (.zip) occurring at least 100 times, so there’s no way to get a look at the unique names. You might think, though, that if three-fourths of them are malformed in some way, that fact would stand out prominently and would have been noticed even before this study was undertaken. You might even think that if 1 percent of respondents are entering names incorrectly, the Census Bureau would have discovered that fact in preliminary testing and would have redesigned the form before circulating it to 300 million people.
Still, I suppose the Bureau’s explanation must be true. There’s spotty suggestive evidence even in the list of names appearing 100 times or more. For example, the list includes surnames such as VANBURKLEO and JOHNSONWILLIAM. And either there are 160 people in the U.S. whose surname is JOHNOSN, or there are 160 JOHNSONs who all made the same transposition error when entering their name on a census form. (Or some combination of the above.)
Even if there are only a million unique names, that still seems like a lot—one out of every 300 people. Galton and Watson looked upon such lonely surnames as dying embers, the last hope of families on the brink of extinction. But some of the rare names are surely newborns rather than expiring elders. Immigration brings names that are new to the U.S. even if they are far from unique globally. And processes akin to mutation and recombination are creating new names all the time. In particular, recombination has become more important now that the purely patrilineal model of name transmission is no longer universal; surnames have broken free from their linkage to the Y chromosome. As a matter of fact, now that I think of it, I was wrong when I said that I have never known a person with a unique surname. I have friends who named their daughter Nina Auslander-Padgham, and her surname surely has a good chance at uniqueness. Or at least it did until Nina’s brother Milo was born.
Out of curiosity, I opened up the Boston-Cambridge phone book, selected a few pages at random, and counted up unique names as a proportion of all names. In a sample of 458 surnames, 254 were listed for one person only, or about 55 percent. This result isn’t too far from the two-thirds ratio in the Census Bureau report, but I’m not sure how to interpret it. The geographic area covered by the Boston directory includes a population of roughly a million, or about 1/300th of the national population. When you select a small sample of this kind—supposing it to be a random sample—what does the selection process do to the frequency distribution of names? If a name occurs 300 times nationally, it could well be unique in Boston, thereby apparently boosting the number of unique names. On the other hand, for every 300 names that truly are unique nationally, only one is likely to be represented in Boston, so in this way the number of unique names is greatly diminished. The question I leave you with is this: How best can we estimate the national (or global) proportion of unique names from a small random sample?
This question you ask appears quite frequently. For example: How best can we estimate the total number of species on Earth (or in a rainforest) from a small geographical sample? (Number of surnames -> Number of species) I would hope the biologists would know something about this.
It was studied from a computer science perspective in the recent paper:
Sofya Raskhodnikova, Dana Ron, Amir Shpilka and Adam Smith.
Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem (FOCS 2007).
I haven’t read this paper, though, so don’t know how practically relevant it really is.
I think it is clear that your assumptions will make a big difference. For example, consider two distributions of American surnames:
a. 300 million different surnames (all unique),
b. 150 million different surnames, each appearing twice (none unique).
Clearly, you can’t tell these two distributions apart unless your random sample is large enough to see a collision (which takes a sample of size sqrt{300 million} = 17000).
“Clearly, you can’t tell these two distributions apart unless your random sample is large enough to see a collision (which takes a sample of size sqrt{300 million} = 17000).”
You’re assuming that samples within a geographic area are independent variables. That is demonstrably not the case here: it wouldn’t be strange, for example, for all five holders of a given surname in the U.S. to live in the same house as a family.
This question calls to mind the distinction that N. N. Taleb draws between the “Gaussian” and the scalable. Since the researchers have demonstrated a nice power-law distribution, we’re definitely in scalable territory. I don’t know what argument the researchers made for the exponent being 2. I think if you could get a consistent non-2 estimate from several different U.S. cities, you’d have just as strong an argument for that exponent.
@Anonymous: The paper by Raskhodnikova et al. (available here) is indeed interesting, but the focus is mainly on the computational complexity of the task, not the question of what algorithm would give the most accurate estimate. Also, just for the record, the question addressed in most of the literature is the total number of names (or species), not the number of uniquely represented names or species. Of course if you know the shape of the distribution, either of these quantities would determine the other.
@Jess: It’s surely true that the geographic distribution of names is not i.i.d., but if we can’t solve the problem in that simple case, we’re going to have trouble with more realistic and more complicated distributions.
The claim that name frequencies follow a power law with exponent β = 2 is in fact based on an empirical observation. (There may be some theoretical justification as well.) Manrubia et al. graph name data from earlier U.S. Census reports and from the Berlin phone book, and in both cases the slope (judged by eyeball) strongly suggests β = 2. Why the new Census data should give such a different result is perplexing. Of course it’s quite possible that I made some blunder in my own analysis.
“…a few features of the curve seem to depart from the predictions. For one thing, the slope of the line gives an exponent closer to β = 1 than β = 2, as Manrubia, Derrida and Zanette would lead us to expect. I can’t explain that.”
Is the discrepancy not explained by the fact that the MDZ exponent is for the number of clans with *exactly* m members, while your exponent is, as you point out, for the cumulative statistic of the number of clans with m members or more? Roughly speaking, your power law is the integral of 1/x^2 (the MDZ power law) from m to infinity, which is 1/m. Or am I failing, as usual, to understand some subtle, obvious point?
Barry must be right. (When it comes to calculus, he’s a guy who never makes Misteaks.)
Nevertheless, I don’t think he has resolved the mystery in this case. Here are the raw and cumulative name frequencies, reorganized into bins of uniform size:
If you take logs and fit a linear function, you’ll find there’s only a tiny difference in the slope—and the change is in the wrong direction. For the cumulative numbers, the slope is 0.942; for the raw numbers 0.929. (Given that both ends of this curve look a little fishy, it’s probably better to fit only to the middle five points. The slopes in that case are 0.854 and 0.833.)
Of course the possibility that I have made some misteak remains very much alive.
I study name distibutions and have published in Nomina (UK), NAMES (USA) and Onomastica Canadiana (Canada). The US Census Bureau results are consistent with my experience. I discussed this isue with David Word years ago. I think that the problem is with our data capture philosophy and our principal data capture devices: The scanner and the keyboard.
What we do is try to recreate the name we have in front of us. Neither device is totally accurate and the garbage ends up in the “small counts”. These devices let us “skip out of the real universe” ie it lets us create non-names and we have no basis after the fact to reject many of these phantoms.
There is a solution: it is to load the known names (the finite universe) into the desktop, laptop, hand held devices, et cetera, and not allow any name that is not known to the machine to enter any system until it has been reviewed. If it is really a new-to-the system name, the sytem gets updated with the name.
We need to take name capture from a ‘recreate’ exercise, which sometimes, unfortunately becomes a ‘create’ execise, to a simple ‘look-up’
Ken