<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/1.5.2" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Last name first</title>
	<link>http://bit-player.org/2007/last-name-first</link>
	<description>An amateur's outlook on computation and mathematics.</description>
	<pubDate>Fri, 29 Aug 2008 05:11:43 +0000</pubDate>
	<generator>http://wordpress.org/?v=1.5.2</generator>

	<item>
 		<title>Comment on Last name first by: D K Tucker</title>
		<link>http://bit-player.org/2007/last-name-first#comment-1571</link>
		<pubDate>Sun, 06 Jan 2008 21:48:54 +0000</pubDate>
		<guid>http://bit-player.org/2007/last-name-first#comment-1571</guid>
					<description>I study name distibutions and have published in Nomina (UK), NAMES (USA) and Onomastica Canadiana (Canada).  The US Census Bureau results are consistent with my experience.  I discussed this isue with David Word years ago. I think that the problem is with our data capture philosophy and our principal data capture devices: The scanner and the keyboard.

What we do is try to recreate the name we have in front of us.  Neither device is totally accurate and the garbage ends up in the &quot;small counts&quot;.  These devices let us &quot;skip out of the real universe&quot; ie it lets us create non-names and we have no basis after the fact to reject many of these phantoms.

There is a solution: it is to load the known names (the finite universe) into the desktop, laptop, hand held devices, et cetera, and not allow any name that is not known to the machine to enter any system until it has been reviewed. If it is really a new-to-the system name, the sytem gets updated with the name.

We need to take name capture from a 'recreate' exercise, which sometimes, unfortunately becomes a 'create' execise, to a simple 'look-up'

Ken</description>
		<content:encoded><![CDATA[	<p>I study name distibutions and have published in Nomina (UK), NAMES (USA) and Onomastica Canadiana (Canada).  The US Census Bureau results are consistent with my experience.  I discussed this isue with David Word years ago. I think that the problem is with our data capture philosophy and our principal data capture devices: The scanner and the keyboard.</p>
	<p>What we do is try to recreate the name we have in front of us.  Neither device is totally accurate and the garbage ends up in the &#8220;small counts&#8221;.  These devices let us &#8220;skip out of the real universe&#8221; ie it lets us create non-names and we have no basis after the fact to reject many of these phantoms.</p>
	<p>There is a solution: it is to load the known names (the finite universe) into the desktop, laptop, hand held devices, et cetera, and not allow any name that is not known to the machine to enter any system until it has been reviewed. If it is really a new-to-the system name, the sytem gets updated with the name.</p>
	<p>We need to take name capture from a &#8216;recreate&#8217; exercise, which sometimes, unfortunately becomes a &#8216;create&#8217; execise, to a simple &#8216;look-up&#8217;</p>
	<p>Ken
</p>
]]></content:encoded>
				</item>
	<item>
 		<title>Comment on Last name first by: brian</title>
		<link>http://bit-player.org/2007/last-name-first#comment-1542</link>
		<pubDate>Tue, 27 Nov 2007 16:21:19 +0000</pubDate>
		<guid>http://bit-player.org/2007/last-name-first#comment-1542</guid>
					<description>Barry must be right. (When it comes to calculus, he's a guy who never makes &lt;a href=&quot;http://www.akpeters.com/product.asp?ProdCode=1225&quot; rel=&quot;nofollow&quot;&gt;Misteaks&lt;/a&gt;.) 

Nevertheless, I don't think he has resolved the mystery in this case. Here are the raw and cumulative name frequencies, reorganized into bins of uniform size:

&lt;pre&gt;
freq                  raw                cum
1+                5493558            6248415
10+                603186             754857
100+               128015             151671
1000+               20369              23656
10000+               3012               3287
100000+               268                275
1000000+                7                  7
&lt;/pre&gt;

If you take logs and fit a linear function, you'll find there's only a tiny difference in the slope—and the change is in the wrong direction. For the cumulative numbers, the slope is 0.942; for the raw numbers 0.929. (Given that both ends of this curve look a little fishy, it's probably better to fit only to the middle five points. The slopes in that case are 0.854 and 0.833.)

Of course the possibility that &lt;em&gt;I&lt;/em&gt; have made some misteak remains very much alive.</description>
		<content:encoded><![CDATA[	<p>Barry must be right. (When it comes to calculus, he&#8217;s a guy who never makes <a href="http://www.akpeters.com/product.asp?ProdCode=1225" rel="nofollow">Misteaks</a>.) </p>
	<p>Nevertheless, I don&#8217;t think he has resolved the mystery in this case. Here are the raw and cumulative name frequencies, reorganized into bins of uniform size:</p>
	<pre>
freq                  raw                cum
1+                5493558            6248415
10+                603186             754857
100+               128015             151671
1000+               20369              23656
10000+               3012               3287
100000+               268                275
1000000+                7                  7
</pre>
	<p>If you take logs and fit a linear function, you&#8217;ll find there&#8217;s only a tiny difference in the slope—and the change is in the wrong direction. For the cumulative numbers, the slope is 0.942; for the raw numbers 0.929. (Given that both ends of this curve look a little fishy, it&#8217;s probably better to fit only to the middle five points. The slopes in that case are 0.854 and 0.833.)</p>
	<p>Of course the possibility that <em>I</em> have made some misteak remains very much alive.
</p>
]]></content:encoded>
				</item>
	<item>
 		<title>Comment on Last name first by: Barry Cipra</title>
		<link>http://bit-player.org/2007/last-name-first#comment-1539</link>
		<pubDate>Thu, 22 Nov 2007 18:20:58 +0000</pubDate>
		<guid>http://bit-player.org/2007/last-name-first#comment-1539</guid>
					<description>&quot;...a few features of the curve seem to depart from the predictions. For one thing, the slope of the line gives an exponent closer to β = 1 than β = 2, as Manrubia, Derrida and Zanette would lead us to expect. I can’t explain that.&quot;

Is the discrepancy not explained by the fact that the MDZ exponent is for the number of clans with *exactly* m members, while your exponent is, as you point out, for the cumulative statistic of the number of clans with m members or more?  Roughly speaking, your power law is the integral of 1/x^2 (the MDZ power law) from m to infinity, which is 1/m.  Or am I failing, as usual, to understand some subtle, obvious point?</description>
		<content:encoded><![CDATA[	<p>&#8220;&#8230;a few features of the curve seem to depart from the predictions. For one thing, the slope of the line gives an exponent closer to β = 1 than β = 2, as Manrubia, Derrida and Zanette would lead us to expect. I can’t explain that.&#8221;</p>
	<p>Is the discrepancy not explained by the fact that the MDZ exponent is for the number of clans with *exactly* m members, while your exponent is, as you point out, for the cumulative statistic of the number of clans with m members or more?  Roughly speaking, your power law is the integral of 1/x^2 (the MDZ power law) from m to infinity, which is 1/m.  Or am I failing, as usual, to understand some subtle, obvious point?
</p>
]]></content:encoded>
				</item>
	<item>
 		<title>Comment on Last name first by: brian</title>
		<link>http://bit-player.org/2007/last-name-first#comment-1538</link>
		<pubDate>Wed, 21 Nov 2007 22:29:33 +0000</pubDate>
		<guid>http://bit-player.org/2007/last-name-first#comment-1538</guid>
					<description>@Anonymous: The paper by Raskhodnikova et al. (available &lt;a href=&quot;http://www.cse.psu.edu/~sofya/&quot; rel=&quot;nofollow&quot;&gt;here&lt;/a&gt;) is indeed interesting, but the focus is mainly on the computational complexity of the task, not the question of what algorithm would give the most accurate estimate. Also, just for the record, the question addressed in most of the literature is the total number of names (or species), not the number of uniquely represented names or species. Of course if you know the shape of the distribution, either of these quantities would determine the other.

@Jess: It's surely true that the geographic distribution of names is not i.i.d., but if we can't solve the problem in that simple case, we're going to have trouble with more realistic and more complicated distributions.

The claim that name frequencies follow a power law with exponent &amp;#946; = 2 is in fact based on an empirical observation. (There may be some theoretical justification as well.) Manrubia et al. graph name data from earlier U.S. Census reports and from the Berlin phone book, and in both cases the slope (judged by eyeball) strongly suggests &amp;#946; = 2. Why the new Census data should give such a different result is perplexing. Of course it's quite possible that I made some blunder in my own analysis.</description>
		<content:encoded><![CDATA[	<p>@Anonymous: The paper by Raskhodnikova et al. (available <a href="http://www.cse.psu.edu/~sofya/" rel="nofollow">here</a>) is indeed interesting, but the focus is mainly on the computational complexity of the task, not the question of what algorithm would give the most accurate estimate. Also, just for the record, the question addressed in most of the literature is the total number of names (or species), not the number of uniquely represented names or species. Of course if you know the shape of the distribution, either of these quantities would determine the other.</p>
	<p>@Jess: It&#8217;s surely true that the geographic distribution of names is not i.i.d., but if we can&#8217;t solve the problem in that simple case, we&#8217;re going to have trouble with more realistic and more complicated distributions.</p>
	<p>The claim that name frequencies follow a power law with exponent &beta; = 2 is in fact based on an empirical observation. (There may be some theoretical justification as well.) Manrubia et al. graph name data from earlier U.S. Census reports and from the Berlin phone book, and in both cases the slope (judged by eyeball) strongly suggests &beta; = 2. Why the new Census data should give such a different result is perplexing. Of course it&#8217;s quite possible that I made some blunder in my own analysis.
</p>
]]></content:encoded>
				</item>
	<item>
 		<title>Comment on Last name first by: Jess</title>
		<link>http://bit-player.org/2007/last-name-first#comment-1537</link>
		<pubDate>Wed, 21 Nov 2007 19:50:16 +0000</pubDate>
		<guid>http://bit-player.org/2007/last-name-first#comment-1537</guid>
					<description>&quot;Clearly, you can’t tell these two distributions apart unless your random sample is large enough to see a collision (which takes a sample of size sqrt{300 million} = 17000).&quot;

You're assuming that samples within a geographic area are independent variables.  That is demonstrably not the case here: it wouldn't be strange, for example, for all five holders of a given surname in the U.S. to live in the same house as a family.

This question calls to mind the distinction that N. N. Taleb draws between the &quot;Gaussian&quot; and the scalable.  Since the researchers have demonstrated a nice power-law distribution, we're definitely in scalable territory.  I don't know what argument the researchers made for the exponent being 2.  I think if you could get a consistent non-2 estimate from several different U.S. cities, you'd have just as strong an argument for that exponent.</description>
		<content:encoded><![CDATA[	<p>&#8220;Clearly, you can’t tell these two distributions apart unless your random sample is large enough to see a collision (which takes a sample of size sqrt{300 million} = 17000).&#8221;</p>
	<p>You&#8217;re assuming that samples within a geographic area are independent variables.  That is demonstrably not the case here: it wouldn&#8217;t be strange, for example, for all five holders of a given surname in the U.S. to live in the same house as a family.</p>
	<p>This question calls to mind the distinction that N. N. Taleb draws between the &#8220;Gaussian&#8221; and the scalable.  Since the researchers have demonstrated a nice power-law distribution, we&#8217;re definitely in scalable territory.  I don&#8217;t know what argument the researchers made for the exponent being 2.  I think if you could get a consistent non-2 estimate from several different U.S. cities, you&#8217;d have just as strong an argument for that exponent.
</p>
]]></content:encoded>
				</item>
	<item>
 		<title>Comment on Last name first by: Anonymous</title>
		<link>http://bit-player.org/2007/last-name-first#comment-1536</link>
		<pubDate>Wed, 21 Nov 2007 07:49:10 +0000</pubDate>
		<guid>http://bit-player.org/2007/last-name-first#comment-1536</guid>
					<description>This question you ask appears quite frequently.  For example: How best can we estimate the total number of species on Earth (or in a rainforest) from a small geographical sample?  (Number of surnames -&amp;#62; Number of species)  I would hope the biologists would know something about this.  

It was studied from a computer science perspective in the recent paper: 
Sofya Raskhodnikova, Dana Ron, Amir Shpilka and Adam Smith.
Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem (FOCS 2007).  
I haven't read this paper, though, so don't know how practically relevant it really is.  

I think it is clear that your assumptions will make a big difference.  For example, consider two distributions of American surnames: 
a. 300 million different surnames (all unique), 
b. 150 million different surnames, each appearing twice (none unique).
Clearly, you can't tell these two distributions apart unless your random sample is large enough to see a collision (which takes a sample of size sqrt{300 million} = 17000).</description>
		<content:encoded><![CDATA[	<p>This question you ask appears quite frequently.  For example: How best can we estimate the total number of species on Earth (or in a rainforest) from a small geographical sample?  (Number of surnames -&gt; Number of species)  I would hope the biologists would know something about this.  </p>
	<p>It was studied from a computer science perspective in the recent paper:<br />
Sofya Raskhodnikova, Dana Ron, Amir Shpilka and Adam Smith.<br />
Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem (FOCS 2007).<br />
I haven&#8217;t read this paper, though, so don&#8217;t know how practically relevant it really is.  </p>
	<p>I think it is clear that your assumptions will make a big difference.  For example, consider two distributions of American surnames:<br />
a. 300 million different surnames (all unique),<br />
b. 150 million different surnames, each appearing twice (none unique).<br />
Clearly, you can&#8217;t tell these two distributions apart unless your random sample is large enough to see a collision (which takes a sample of size sqrt{300 million} = 17000).
</p>
]]></content:encoded>
				</item>
</channel>
</rss>
