Archive for the ‘modern life’ Category

Spam stats

Thursday, June 5th, 2008

Hormel Foods, the Minnesota meatpacker, reports a surge in sales of Spam. News accounts attribute the rising popularity of the pink meat-in-a-can to higher prices for other commodities. Or maybe it’s the Spam musubi fad.

Meanwhile, the other kind of spam seems to be surging as well. I’ve been keeping track of my personal spam consumption for the past five years. (I first wrote about this in 2003, with a follow-up in 2007.) Here’s a record of the total number of messages landing in my spam bin each month since the start of 2007:

spamvolume.png

The lull last spring gave me some hope that spam was finally in decline; the monthly intake even fell below 1,000 messages in March and April. But the respite didn’t last. There was steady growth through last summer and fall, and now another spike in volume has brought the rate to nearly 3,000 messages per month.

The message counts charted above lump together spam sent to several email addresses. Here’s a breakdown by address, covering the entire 17-month period:

mailboxes.png

The two addresses that attract the most unwanted traffic—namely, my address here at bit-player.org and another at amsci.org—are both published openly on the web, without any form of obfuscation. So are the addresses identified in the pie chart as “il-perms” and “il-prints”; they appear on my industrial-landscape.org web site. I’m certainly not surprised that spammers have discovered these addresses; they are fair game to anyone who knows how to scrape a web site. But there are still some puzzles in the data. I have several more email addresses that are equally vulnerable—they are published in the same places—but they receive nary a spam. Why not? And my earthlink.net and acm.org addresses are not published (or even much used), yet they get a healthy share of junk mail.

The content of the spam remains much the same—replica watches, blue pills, pirate software, phishing expeditions. Numbingly repetitious. In one week I got 25 messages with the same subject line: “eBay New Unpaid Item Message from snorelax67.” Then there were the 34 messages with subject lines such as “Viadzgra - $1.20,” “Viabqgra - $1.75,” “Viafmgra - $1.09″ and “Viategra - $1.38.” (Evidently someone has written a little program to insert random letter pairs in the middle of the word. My spam filter was not fooled. Nor did it fall for “Hihg - qualiyt repliacs of the ebst lcock of the wrold!!”) In “How Many Ways Can You Spell V1@gra?” I argued that most of the world’s spam is coming from a relatively small number of senders—tens or possibly hundreds, but not thousands—and I think the evidence continues to support that conjecture.

One interesting trend in my spam is that it seems to be growing more cosmopolitan. Back in 2003, about 18 percent of the spam I received was written in languages other than English; the figure now is 34 percent. The distribution of languages is curious. Here are the data for May 2008, when I received a total of 933 non-English spams:

spamlangs.png

Does everybody get gobs of spam in Russian, or is it just me? Is there something about my Internet activity that leads mailing-list compilers to believe I read Russian? Well, here’s the sad truth: My knowledge of Russian is so totally lacking that I’m not even sure all those messages are really Russian. They come with a Cyrillic character encoding, but for all I know some of them could be Bulgarian or Ukrainian. I’m equally in the dark about the 153 messages that appear to be written in various Asian languages (Chinese, Japanese, Korean). As for the German messages, they are something of a novelty. Until a few weeks ago, I almost never saw spam in German, and now there’s a sudden spate. It’s pretty clear that all of it comes from the same source. I’m seeing no French spam, nor Portuguese, nor Hindi, Urdu, Arabic, Hebrew.

Linguistic diversity is laudable, and in general I’m pleased to see challenges to Anglophone hegemony. I’m always flattered when someone addresses me in another language—even if I can’t respond in kind. But in this case I’m afraid there’s no reason to be congratulating myself. The spammers are not sending me these multilingual documents because they take me for an accomplished and urbane polyglot. They’re sending them to me (and to millions of others) because selectivity just isn’t worth the bother. Addressees like you and me are too cheap to count. Spam is becoming something like the cosmic microwave background radiation. It’s everywhere, it’s meaningless, it can be mistaken for birdshit.

Update 2008-07-01. More pink meat. I’ve tallied up the receipts for June, and my personal spam volume has set a new record: 3,354 messages, an increase of 20 percent over the previous high of 2,794 in May. The updated graph now covers 18 months:

spamvolume701.png

It’s worrisome to see the quantity growing so fast, but let me try to put the matter in perspective. Alongside the 3,354 spams I received in June, I also received 1,245 nonspam messages. Thus the proportion of spam is about 73 percent—well under the figure of 90 percent that’s often bandied about by companies that sell anti-spam products and services. Moreover, the spam causes me very little actual bother; almost all of it goes directly into the junk folder without need for human intervention. The nonspam messages, on the other hand, demand to be read and responded to. Perhaps I’d get more accomplished if more of my mail were spam.

I have not done a language analysis of the new batch, but I can tell at a glance that I’m still attracting a bizarre glut of Russian spam. A subject line that caught my eye reads:

programspam.png

I can sound out just enough Russian to guess the transliteration “programme spam.” Inside the message is an image of an advertisement (also in Russian) for various warez. But the decoy text that’s meant to get the message through the spam filters is a sports story written in German. Thus even individual messages are now becoming multilingual.

Update 2008-09-01: When I started this thread back in the spring, I thought I was taking note of a step function in the spam rate—a sudden jump from 2,000 a month to a new plateau at 2,500 a month. The trend looks different now: not a series of steps but sustained steady growth, with an increment of roughly 500 a month:

spamvolume901.png

Total spams received in my various inboxes came to 3,886 for July and 4,489 for August.

And I continue to be amazed and baffled by the quantity of Russian-language spam. The proportion of my spam written in a Cyrillic alphabet is now above 40 percent. The growth in Russian-language messages accounts for about two-thirds of the overall increase in the past few months. Should I read some geopolitical meaning into this trend?

On the spot

Saturday, May 24th, 2008
redspot.jpg

Wow. Jupiter has sprouted a third red spot. It was just two years ago that the Great Red Spot was joined by a smaller companion, which was quickly dubbed “Junior.” I guess the new red spot, discovered in the past few weeks, will have to be called “III.”

In the view above, from the Hubble Space Telescope, Junior is southwest of the Great Spot, and the new, smallest member of the family is due west of the big one and a little farther downwind. This is a false-color image, constructed by assigning colors to monochromatic images recorded at three wavelengths, but the intent is to correctly render colors as perceived by the human eye. Evidently none of the spots are really red at the moment. If they were all newly discovered right now, we would have the Great Peach of Jupiter and the Two Little Apricots.

When I get beyond merely admiring the glorious, painterly spectacle of this Jello-chiffon dessert in the sky, what fascinates me most is the time scale of the red spot phenomenon. The Great Spot has been there for at least a century or two, and probably much longer. It is a storm, with rapid counterclockwise circulation clearly visible in the time-lapse photos returned by the Voyager I spacecraft in 1979.

Storms are something we can relate to from our earthling experience; we have cyclones here too. But what kind of storm lasts for hundreds of years? Even allowing for the larger spatial scale of events on Jupiter, the Great Spot seems extraordinarily long-lived. The rotation period is roughly one earth-week, which means the spot has survived for something on the order of 10,000 revolutions. And it is geographically stable, too: Although the spot drifts in longitude, it seems to be pinned in latitude, hovering at a swirling boundary between easterly and westerly wind belts.

Very likely, the key to the Great Spot’s longevity is that Jupiter has no continents or other surface irregularities to disrupt the flow of the atmosphere. But that fact makes the uniqueness of the spot somewhat mysterious. If such features can arise spontaneously, purely from the dynamics of the atmospheric flow, like a pearl created without any need for a grain of sand, then why is there just one red spot? You’d think that such storms would develop from time to time wherever conditions were favorable.

And now we have our answer: There’s not just one red spot. But the question of time scales doesn’t entirely go away. It seems implausible that one storm would go on for centuries in lonely splendor, and then suddenly two more would evolve within a couple of years. Perhaps there have been others and we just didn’t notice? Not within the past 50 years, I think. Another possible explanation of this improbable coincidence is that the births of Junior and III are not independent events. All three storms are nearby (at least by Jovian standards) and are surely interacting. If that’s the case, we may not have seen the end of this sequence of events. Will there be more spots? Will they collide or coalesce? Stay tuned.

In the matter of time scales, I can’t help noting that Jupiter has a connection with another epochal event in the modern Internet era. In July of 1994 comet Shoemaker-Levy 9 crashed into Jupiter, and the world followed along via the web. The idea that anyone with a modem could download the images directly from JPL—no waiting for the news media—made quite an impression. The Netscape icon was the apotheosis of this event.

Links:

More third-spot images and explanations of how they were made, from Imre de Pater, UC Berkeley.

Reporting from Science Blog.

Reporting from New Scientist.

A report from the Philippine Daily Inquirer with some background on who first spotted the new spot.

The Wikipedia article on the Great Red Spot (which already has a note on the new one).

Get on board

Tuesday, February 12th, 2008

Ages ago (in blog years) I mentioned some algorithmic ideas for getting passengers aboard airplanes faster, based on a 2005 paper by Steven Skiena and others. Since then, the queue at the departure gate has only gotten longer. Now another preprint on the same theme has landed in the arXiv. This one is by Jason H. Steffen, a postdoc at Fermilab.

Steffen assumes that the main impediment to speedy boarding is the time passengers need for stowing their carry-on luggage. He argues that the loading process will go faster if we make sure everyone has plenty of elbow room for cramming their wheely-bag into the overhead bin. Thus he favors boarding-line sequences generated by the following rule: If two passengers are seated near each other in the aircraft (in the same row or in adjacent rows), then they should not be adjacent in the queue.

I don’t necessarily agree with Steffen’s premise or his conclusion, but I have no evidence of my own to report, so let’s set that issue aside. I’m intrigued by a related, subsidiary question. If we assume that there is some optimal ordering for passengers as they enter the airplane, how do we organize the unruly and impatient mob at the departure gate so that everyone enters the plane in the specified order? Here are a few ideas.

Boarding Hats. Most airlines now use some kind of zone system, where the passengers are divided into several groups. As boarding time approaches, people mill around near the gate asking, “Have they called Group 2?” or “Is this the line for Group 4?” To achieve finer-grain control over boarding order, we would need larger numbers of smaller groups, which would make the process of finding the right group even more cumbersome. In the limiting case, each passenger would be assigned an individual boarding-sequence number, and the passengers would have to sort themselves into the correct sequence. People are actually rather good at this task, given enough information. When I was a schoolboy, my classmates and I could quickly line ourselves up in order of height, relying on an efficient parallel sorting algorithm. But that method works so well mainly because height is a visible trait. To bring to same efficiency to sorting passengers at the departure gate, the boarding sequence number must be as readily discernible as height. Thus I propose replacing the traditional boarding pass with the boarding hat, which has your number prominently printed on all sides. This innovation would be a special treat for the mathematical community, given the well-known genre of puzzles about mathematicians who can see the number on everyone else’s forehead but not their own.

The Boarding Buzzer. If you think a numbered paper hat is too undignified for airline passengers, here’s another replacement for the boarding pass. When you check in for a flight, suppose you get an electronic gadget like one of those buzzers that restaurants hand to customers who are waiting for a table. As I imagine the boarding buzzer, it has various blinking lights and sound effects and a display screen that counts down the minutes remaining until you are due to report to the gate. The time shown on the display is different for each passenger and is calculated to get everyone on board at just the right moment. When I first thought of this scheme, I dismissed it as preposterous technological excess. But when I told a friend about it, she said I shouldn’t blog it; I should patent it. Obviously I haven’t taken my friend’s advice, and so when I check in at an airport a few years from now and the agent hands me one of these contraptions, I’m going to be mightily annoyed to see that someone else has cashed in on my idea. Still, the device could have certain charms. It would allow you to wander throughout the airport rather than remain tethered to the departure lounge. If a flight were delayed or shifted to a different gate, the airline could notify you right away. Likewise, if you were on standby, you could be paged when a seat opened up.

Out-of-Order Execution. The problem of dealing with suboptimal sequences of events is familiar to designers of computer hardware. Many modern microprocessors analyze the stream of instructions awaiting execution and reorder them to improve throughput. If the next instruction in the stream can’t be executed immediately because its operands aren’t yet available, then maybe some other instruction can take its place. This principle could also be applied to airplane boarding. As passengers enter the jetway, their boarding passes are scanned, and so their actual order in the queue is known from that point forward. Suppose there were a small buffer area at the other end of the jetway, near the aircraft door, along with a display screen where passenger names could be listed. This scheme would allow groups of passengers to be rearranged at the last minute to avoid bottlenecks. The overall boarding order might not be ideal, but it could be locally optimized.

First Come, First Served. A few airlines have abolished seat assignments altogether: You line up at the gate, file onto the airplane, and take any unoccupied seat you choose. In my experience, the boarding process on open-seating flights is generally quick and efficient. But the protocol creates an incentive to be at the head of the queue, and so people start lining up at the gate quite early. Thus the airline gets the benefit of faster loading, but the passengers likely spend more time waiting in line.

The Worst of Both Worlds. Here’s an idea so bad I hesitate even to mention it, lest some airline decide to try it. Suppose all seats are assigned, but you don’t receive your assignment when you book a flight or when you check in at the airport. Instead the assignment is made as you hand in your boarding pass at the gate. This allows the airline to parcel out the seats in whatever order will optimize the boarding process, but it leaves the passengers with little or no control over where they sit. (After long observation you might figure out something about the assignment algorithm and thereby learn where not to stand in line.)

A final thought about luggage: If Steffen is right that carry-on baggage is the main cause of delay, some other tactics might be considered. What if passengers without carry-on bags were allowed to board first? By assumption they would board quickly, relieving congestion for the heavily laden crowd to follow. The policy might also induce some passengers to carry less luggage.

Then there’s the Russian-made aircraft I flew on once, many years ago. You reached the passenger cabin by walking through a lower deck lined with luggage racks, dropping off your bag on the way in and grabbing it on the way out. Probably preferable to wearing a numbered hat or carrying a buzzer.

Last name first

Tuesday, November 20th, 2007

Saturday’s New York Times had a story by Sam Roberts about a newly released Census Bureau study of the frequency of surnames in the U.S. The Times story was mainly about the names at the top of the list, and especially the increasing prominence of Hispanic names (Garcia and Rodriguez have made it into the top ten). But what caught my attention was a passing comment about the bottom of the frequency distribution:

Altogether, the census found six million surnames in the United States. Among those, 151,000 were shared by a hundred or more Americans. Four million were held by only one person.

I was not surprised to learn that the distribution of name frequencies is steeply skewed, with a few common names and a great many rare ones. But could it be true that two-thirds of the names occur just once in the population—that four million people in the U.S. have a unique family name they share with no one else?

Looking through the lens of personal experience, I found it hard to believe those numbers. Over the years I’ve met some people whose family names are surely rare, but I am not aware of a single acquaintance who is the holder of a unique name—if only because everyone I know shares a name with parents or children or siblings or a spouse. After all, family names tend to run in families! To have a unique name, you’ve got to be the first of your line or the last of your line or both.

The study of name distributions has a long history. In the 1870s Francis Galton and Henry William Watson looked into the longevity of family names, concluding:

All the surnames, therefore, tend to extinction in an indefinite time, and this result might have been anticipated generally, for a surname once lost can never be recovered, and there is an additional chance of loss in every successive generation.

The argument sounds good, but it’s not quite as broadly applicable as Galton and Watson thought it was. Extinction is inevitable only in a static or shrinking population. If the population is growing, names and families can become all but immortal. In the 1920s Alfred Lotka calculated that American family names had about an 18 percent chance of surviving indefinitely. More recently, Susanna C. Manrubia, Bernard Derrida and Damián H. Zanette have developed a more refined computer model of name evolution (see arXiv preprint 1 and 2; there’s also a splendid American Scientist article, but annoyingly it’s only accessible to subscribers). Manrubia, Derrida and Zanette describe an equilibrium state where the distribution of names follows a power law. If we define a “clan” as the set of all people who have a surname in common (whether or not they are actually related), then the predicted number of clans of size m is proportional to m–β. Manrubia, Derrida and Zanette argue that β = 2. Thus, for example, clans 10 times larger should be 100 times rarer.

How do the new Census Bureau findings stack up against these predictions? Here is the frequency table included in the summary report (.pdf):

Table of frequencies of last names

For this data set the cumulative numbers are easier to work with because of the nonuniform bin sizes. Here’s how they look in a graph:

graph of cumulative name frequencies

Graphs of this kind can be confusing. I find it helpful to keep in mind that a point at coordinates x,y indicates there are y clans with x members or more.

If clan frequencies were governed by a strict power law, the graph would trace a straight line on these log-log scales. Overall, the curve is indeed fairly straight, tending to support the power-law model. But a few features of the curve seem to depart from the predictions. For one thing, the slope of the line gives an exponent closer to β = 1 than β = 2, as Manrubia, Derrida and Zanette would lead us to expect. I can’t explain that. A steepening of the curve at the large-clan end could be an artifact of finite sample size. Most interesting of all is the sudden uptick at the opposite end of the curve, where clans of size 1 are much more abundant than the power law predicts. On a logarithmic scale it’s easy to misjudge the magnitude of such a trivial-looking excursion: If the two leftmost data points (for clans of size 1 and size 2 through 4) were restored to the trend line of the data from clan sizes of 10 through 1,000, the total number of names in the survey would be about three million instead of six million, and there would be only one million unique names instead of four million.

I’ll not keep you in suspense any longer about the cause of this anomaly. When I downloaded the Census Bureau report, I found that the authors (David L. Word, Charles D. Coleman, Robert Nunziata and Robert Kominski) are also skeptical about those four million solo monikers. They explain that the data came from census forms on which respondents were asked to print the first, middle and last names of all household residents; the forms were then electronically scanned, and the answers were extracted by optical character recognition. Errors at any point in the process could turn a common name into a unique (but fictitious) one—making a MLLLER out of a MILLER, say. Some of these errors were corrected in later processing, but others apparently slipped through. One particularly troublesome problem arose whenever a respondent printed an entire name in the space intended for the surname. The OCR software simply concatenated all the parts of such a response, leading to spurious surnames such as PETERJDAVIS. The report states that “many” of the four million unique names are products of such data-entry errors, but there is no attempt to quantify the effect.

For privacy reasons, the Census Bureau has released only the 151,671 names (.zip) occurring at least 100 times, so there’s no way to get a look at the unique names. You might think, though, that if three-fourths of them are malformed in some way, that fact would stand out prominently and would have been noticed even before this study was undertaken. You might even think that if 1 percent of respondents are entering names incorrectly, the Census Bureau would have discovered that fact in preliminary testing and would have redesigned the form before circulating it to 300 million people.

Still, I suppose the Bureau’s explanation must be true. There’s spotty suggestive evidence even in the list of names appearing 100 times or more. For example, the list includes surnames such as VANBURKLEO and JOHNSONWILLIAM. And either there are 160 people in the U.S. whose surname is JOHNOSN, or there are 160 JOHNSONs who all made the same transposition error when entering their name on a census form. (Or some combination of the above.)

Even if there are only a million unique names, that still seems like a lot—one out of every 300 people. Galton and Watson looked upon such lonely surnames as dying embers, the last hope of families on the brink of extinction. But some of the rare names are surely newborns rather than expiring elders. Immigration brings names that are new to the U.S. even if they are far from unique globally. And processes akin to mutation and recombination are creating new names all the time. In particular, recombination has become more important now that the purely patrilineal model of name transmission is no longer universal; surnames have broken free from their linkage to the Y chromosome. As a matter of fact, now that I think of it, I was wrong when I said that I have never known a person with a unique surname. I have friends who named their daughter Nina Auslander-Padgham, and her surname surely has a good chance at uniqueness. Or at least it did until Nina’s brother Milo was born.

Out of curiosity, I opened up the Boston-Cambridge phone book, selected a few pages at random, and counted up unique names as a proportion of all names. In a sample of 458 surnames, 254 were listed for one person only, or about 55 percent. This result isn’t too far from the two-thirds ratio in the Census Bureau report, but I’m not sure how to interpret it. The geographic area covered by the Boston directory includes a population of roughly a million, or about 1/300th of the national population. When you select a small sample of this kind—supposing it to be a random sample—what does the selection process do to the frequency distribution of names? If a name occurs 300 times nationally, it could well be unique in Boston, thereby apparently boosting the number of unique names. On the other hand, for every 300 names that truly are unique nationally, only one is likely to be represented in Boston, so in this way the number of unique names is greatly diminished. The question I leave you with is this: How best can we estimate the national (or global) proportion of unique names from a small random sample?

V1@gra from the source

Thursday, September 6th, 2007

The last time I was ranting about spam, I inquired of Pfizer, the makers of Viagra, how they filter spam from their own incoming mail stream. They can hardly block all messages that mention their own product. They never got back to me with an answer. Now perhaps I know why. Wired News reports that zombies on Pfizer’s internal network are the source of a recent spam storm!

My summer vacation

Monday, August 27th, 2007

One of the drawbacks of not having a job is that you never get a vacation. Thus the only way I could get away this summer was to take an unpaid leave from blogging. Now I’m back, though—once again ungainfully unemployed. I want to thank all my faithful readers for their forbearance during my absence. I know you missed me, and it was very kind of you—brave, even—to refrain from nagging.

V1@gra

Wednesday, June 13th, 2007

I watched the spelling bee on TV a couple of weeks ago and was stumped by word after word: aniseikonia, oberek, randkluft, cachalot, schuhplattler, cilice. It’s all enough to send you reeling back to Andrew Jackson or Mark Twain or Winston Churchill or whoever the hell it was who said “I don’t give a damn for a man that can only spell a word one way!” As it happens, I’ve been writing lately about words that get spelled and misspelled in lots and lots of ways. My Computing Science column in the July–August issue of American Scientist asks the question: “How many ways can you spell V1@gra?”

Disclaimer: I ask the question but I can’t answer it—or at least I can’t give a definite number or a close approximation.

Another question also goes unanswered. The curious and creative spellings prevalent in spam are (presumably) intended to evade the filters that most of us have installed on our e-mail. Because the word “Viagra” is uncommonly common in spam, most e-mail that mentions it gets dumped in the junk bin. So what do you do if you frequently need to discuss Viagra in your correspondence? In particular, what about Pfizer, the company that makes and markets the stuff? Surely their corporate mail servers can’t be running filters that block all references to their own product.

I tried to find out how Pfizer deals with this problem. I sent an e-mail query to their public relations department. I got no response—which could of course be taken as an answer to my question. I tried following up by telephone, but no one I spoke with was able to shed any light on the issue. So, if anyone from Pfizer should read this, please get in touch; I’d still like to know the answer.

By the way, has anyone noticed that “Pfizer” looks a spelling invented by a spammer?

Amazon poker

Thursday, May 10th, 2007

Investors are constantly checking the stock ticker, gamblers check the point spread, and everybody is forever checking their e-mail. For a writerly type like me, however, the unshakeable obsession is checking my Amazon sales rank. Amazon.com calculates a sales rank for every book listed on its Web site, and updates the ranking hourly. Here’s a graph of the hourly fluctuations in the ranking of my book Infrastructure: A Field Guide to the Industrial Landscape over the course of a single day last week:

Rankforest graph of Infrastructure sales rank 2007-05-03

At any given moment the current rank is listed in the “Product Details” section of the Amazon page for both the hardcover and the paperback editions. The graph above comes from a service called Rankforest, which also tracks both the hardcover and the paperback versions.

From an author’s point of view, there’s a lot of mystery in these numbers. How are the hourly rankings calculated, and what (if anything) do they mean? How do the rankings correlate with actual sales of the book? Amazon is not telling, and so authors and other interested parties have been left to speculate and experiment. The obvious experiment is to order a copy of the book and observe the effect of this purchase on the ranking. I welcome such experimentation with my book. (I would prefer that you do it with the hardcover edition, for which I get a more generous royalty.)

Morris Rosenthal of Foner Books seems to be the leading scryer of signs in this field. The interpretation of the rankings has also been discussed by Chris Anderson, the editor of Wired, in an article, a blog, and a book (current Amazon rank = 463, way above mine, dammit). It’s clear that the hourly ranking cannot simply reflect the number of copies sold within the past hour. After all, in any given hour the vast majority of books sell zero copies, and so there would be a gigantic tie for last place. The rankings also can’t be based in any simple way on total sales since publication, because the standings are much too volatile. The received wisdom is that each sale produces an uptick, followed by an exponential decay until the next sale.

I find all these speculations fascinating, but they are not what I want to write about today. My topic is even more trivial. I’ve been tracking my Amazon ranking for more than a year and a half now, ever since the hardcover edition was first listed in September 2005. (The paperback came out about a year later.) Here’s what the trend looks like:

Amazon rankings of Infrastructure since publication

The graph records the highest (i.e., numerically smallest) rank noted on each day, with rare gaps when I happened to be offline all day. Without question it would be fairer to use the daily average rather than the daily peak. But an author’s ego is a fragile thing, and so I have gone out of my way to make the outlook as rosy as I could.

Here are some of the actual numbers I recorded, for the month of April 2007:

   date          HB      PB
2007-04-01     138263   47878
2007-04-02     192152   22728
2007-04-03      29146   41862
2007-04-04      73628   57155
2007-04-05      37172   16948
2007-04-06     127858   12779
2007-04-07     171363   25212
2007-04-08      55256   20770
2007-04-09
2007-04-10      36333    7121
2007-04-11     112423   19015
2007-04-12     183063   40015
2007-04-13     225457   46781
2007-04-14     239879   29259
2007-04-15     142252   16030
2007-04-16      24803   39200
2007-04-17      93485   18939
2007-04-18     173691   26434
2007-04-19     217440   19360
2007-04-20      44426   17276
2007-04-21     213765   26652
2007-04-22      39014   21699
2007-04-23     150012   18598
2007-04-24      61301   46268
2007-04-25      33800   22474
2007-04-26      10603   39335
2007-04-27      19746   15984
2007-04-28      14027   19324
2007-04-29      16527   25999
2007-04-30       5844   63439

Notice anything out of the ordinary? What I’m looking at is not the overall pattern of rising and falling magnitudes but rather the inner patterns of digits within the numbers. As I’ve been writing down these rankings over the past 20 months, I have had the persistent impression of a peculiar overabundance of repeated digits. Just in this small sample we have 47878, 22728, 192152, 57155, 25212, 36333, 44426, 33800, 39335, 25999, and lots more. Is there something going on here? Is Jeff Bezos broadcasting secret signals hidden in the digit patterns of the Amazon sales rankings?

I consider myself a reasonably sophisticated probabilist. I know I’m not supposed to be shocked when I bring together 23 people and find that two of them share a birthday. And I know that when people try to generate a random sequence of digits by plucking numbers out of their imagination, the result is almost always too homogeneous, with a deficiency of repetitions and other patterns of the kind I’m calling attention to. Thus I was prepared to believe that the patterns I perceived were conjured up out of nothing—phantom regularities in purely random data. Still, day after day, I would note numbers like 55256 and 20770 (which in fact appeared on the same day, one for the hardcover, one for the paperback), and wonder about the odds of such coincidences. Maybe I wasn’t just letting my imagination run away with me.

Finally, this past weekend, I could stand it no longer; I had to find out.

I’m going to have to give away the punchline right now, lest it come as a disappointment later: There is nothing unusual about those numbers. The distribution of digits—the number of pairs and triples and what-not—is well within the range expected for randomly generated numbers of the same size. In other words, I have confirmed the null hypothesis. Often, such a negative result is considered unpublishable, but this one cost me a fair amount of effort, and so I’m determined to get a blog item out of it for better or worse. If the conclusion itself is not very interesting, perhaps I can find something to say about the techniques and technologies that led to it.

How do you decide whether a number like 36333 is too unusual to be a product of random processes? I decided to analyze the numbers in my sample as if they were poker hands (with the rules of poker adapted to a deck of cards with ten ranks and no suits). I confined my attention to the five-digit numbers, which are the most numerous in my sample; of the 850 rankings I had recorded, 458 were in the range between 10,000 and 99,999. Then I wrote a five-line program that takes any such number and classifies it as one of seven types of poker hands:

  • five of a kind (e.g., 77777)
  • four of a kind (36333, 11119)
  • full house (44555, 28288)
  • three of a kind (57155, 20900)
  • two pairs (97097, 28002)
  • one pair (36739, 14912)
  • bust (53208, 16897)

I decided to ignore straights; the poker hand 56789 is simply a “bust” according to this scheme of classification. The concept of a flush—all cards of the same suit—doesn’t arise, since there are no suits.

When I ran my little program over the 458 five-digit Amazon sales rankings, here’s what I got:

   hand            number       frequency

five of a kind        0          0.0000
four of a kind        3          0.0066
full house            7          0.0153
three of a kind      34          0.0742
two pairs            51          0.1114
one pair            233          0.5087
bust                130          0.2838

   TOTAL            458          1.0000

The question now, of course, is what I should expect to see in such a data set, on the hypothesis that the digit patterns are random rather than contrived for some secret, nefarious purpose. Should I find it remarkable that half of the hands have a single pair, or that more than 70 percent have a pair or better? What about those seven full-house hands—should that abundance arouse suspicion? To begin answering questions like these, we need to calculate the probabilities of the various hands.

Having just boasted of my sophistication as a probabilist, I must now confess that the main thing I’ve learned about calculating probabilities is that getting the right answer is highly improbable. I understand the principle of the thing. It’s just a matter of counting. You count the “success” cases and divide by the total number of cases. But counting—even though it tends to come early in the mathematical curriculum—is not always easy.

In the Amazon poker problem, the first trap for the unwary is in counting the total number of cases. You might think that with five decimal digits, there would be 105 possible arrangements, but in fact 104 of those arrangements are excluded from the sample, because numbers in this context cannot have a leading digit of zero. Thus the denominator in the probability calculation will be 90,000 rather than 100,000.

I think I can trust myself to calculate the probability of a five-of-a-kind hand. The first card dealt must not be a zero but can be any of the other nine digits; thereafter, each subsequent digit must be identical to the first one. Thus the number of ways of forming a five-of-a-kind hand is 9×1×1×1×1, and the probability of this outcome is 9/90,000, or 0.0001. You can expect five of a kind in one hand out of every 10,000, if the deal is fair.

I believe this answer is correct, and I am proud of having obtained it; on the other hand, the five-of-a-kind calculation is by far the easiest case. Allow me to try to work out a harder problem: the odds of a full house. I invite you to listen in on my so-called thought process:

Well, a full house is a hand that matches the pattern aaabb. The first digit can be anything but zero, and so there are nine possibilities, but then the second and third digits have to match the first. The fourth digit must differ from the first three, and so there are eight candidates left…. No, wait…. This time zero is allowed, and so there are nine possibilities again. Then the fifth digit has to be identical to the fourth. That gives us the product 9×1×1×9×1, or 81 successes out of 90,000 total cases, for a probability of 0.0009…. Did I get that right?… Of course not. What I’ve calculated is the probability of seeing the pattern aaabb in that precise sequence; I am counting occurrences of 11100, 11122, 11133, …, 99988, but I am not including sequences such as 23233 or 45454. Okay. So one approach, now that I have the number of aaabb sequences, is to multiply by the number of ways of permuting that sequence. We can just take the aaa and intercalate the two b’s in all possible positions: bbaaa, babaa, baaba,…. No, wait…. The b could be a zero, and so it’s not always allowed to appear in the first position. Hmmm. This is not going well. For the time being, let’s forget about the prohibition of leading zeros, and we’ll correct for it later. That way we can consider all possible permutations of aaabb. As a first step, take aaa, where a can be any of the ten decimal digits, and place a b is all possible positions, allowing b to assume any value that differs from a, so that there are nine choices. There are four places to put the b: baaa, abaa, aaba and aaab. Now, for each of these four sequences, the second b can be placed in any of five positions, and so there are 4×5 = 20 permutations overall. Which means that the total number of full houses is…. Hold on. No, no, no, no, no. Not all those 20 permutations are distinguishable. If we start with baaa and insert a b in either the first or the second slot, we wind up with bbaaa in either case; this sequence should not be counted twice. So how many permutations are there, really? Offhand, I don’t see any way to count them that’s easier than direct enumeration: bbaaa, babaa, baaba, baaab, abbaa, ababa, abaab, aabba, aabab, aaabb. That’s ten cases. Let’s sum up. We know that a can take any of ten values and b has nine possible values, and there are ten ways of arranging three a’s and two b’s. Thus the total number of combinations is 10×9×10 = 900. But, don’t forget, now we have to subtract away all those sequences that start with a zero. For this purpose it doesn’t matter whether the first symbol is an a or a b; exactly 10 percent of the sequences will have a leading zero. Thus the number of full houses is 900–90=810. This gives a probability of 810/90,000 = 0.009.

I happen to know, from an independent calculation, that this answer is correct, and so it’s time to stop. If I didn’t know, however, I might well go on to ask whether we need to interchange the a’s and b’s—that is, consider the case of sequences aaabb where a has only nine allowed values and b has ten. (Why don’t we have to take that into account?)

I am mildly embarrassed to put this lurching, stumbling, caricature of a probability calculation on public exhibition, and yet it is a fair description of how I often struggle with a problem of this kind. Am I the only one who suffers so? Not everyone does. I know people who could carry out the same computation quite deftly; they command the intuition, the spürkraft, to zero in immediately on the right approach, like a chess player who doesn’t waste time considering fruitless moves. I admire that kind of finesse, but I don’t possess it.

On the other hand, I know something else. I know another way to solve the problem, with less fuss. As I mentioned above, I have already cooked up a little program that can take any five-digit Amazon rank and classify it into one of the seven categories of poker hands. In milliseconds I can run that program on all possible five-digit numbers; after all, there are only 90,000 of them, and it’s quite easy to generate all of them in sequence. Then I can just count the number of hands in each category, and all the probabilities come tumbling out in a neatly formatted table:

   hand            number       frequency

five of a kind        9          0.0001
four of a kind      405          0.0045
full house          810          0.0090
three of a kind    6480          0.0720
two pairs          9720          0.1080
one pair          45360          0.5040
bust              27216          0.3024

   TOTAL          90000          1.0000

If Fermat or Pascal or the Bernoullis were asked their opinion on this approach to probability, something tells me they would find it distasteful. I’m ambivalent myself. Returning to the chess analogy, this is the equivalent of the machine that beats a grandmaster by brute force, scanning millions of positions but knowing nothing of strategy. You can win that way, but you don’t get any style points. More important, the method is frustratingly opaque. I get the answers, and I have reasonable confidence that they’re correct, but the computation gives me no understanding of why they’re correct. Still another objection is that the method does not scale well. If my Amazon rankings had ten digits instead of five (perish the thought!), I’d have a hard time classifying the nine billion possible hands. (But then again I’m not sure the more analytic method scales all that well either.)

Setting aside these qualms, we can now take the predictions of theory and the observations of the Amazon sample and compare them side by side:

   hand            prediction       observation

five of a kind        0.0001          0.0000
four of a kind        0.0045          0.0066
full house            0.0090          0.0153
three of a kind       0.0720          0.0742
two pairs             0.1080          0.1114
one pair              0.5040          0.5087
bust                  0.3024          0.2838

   TOTAL              1.0000          1.0000

Some of these frequencies match quite closely (three of a kind, one pair); others are a bit off the mark (full house, bust). In a finite sample, of course, you would never expect an exact match to the theoretical frequencies—but are the discrepancies we’re seeing significant or not? My personal instinct says that there’s nothing amiss here, that the observed frequencies are consistent with the null hypothesis. In other words, the rankings could just as well be random numbers. The old rule of thumb that the variation should be less than the square root of the observation leads to the same conclusion. We could quantify these intuitions by calculating variances or standard deviations, doing a Χ2 test, and so on. But if I can barely calculate a simple probability, can I be trusted to navigate all those treacherous subtleties such as choosing the correct number of degrees of freedom?

Again there’s another way to go about it, relying on lots of ignorant computation to replace a little smart mathematics. The question we want to answer is this: Given a set of 458 randomly generated five-digit numbers, what is the probability that the random set will differ from the predicted frequencies by at least as much as the observed Amazon set? Suppose we measure distance from the theoretical prediction in terms of the sum of the squared differences:

S^2=\\sum_{i=1}^7(y_i-\\bar{y}_i)^2

where the index i ranges over the seven types of poker hand, and the expression in parentheses is the difference between the observed and predicted frequency for each type of hand. (Technical note: The seven numbers entering into this statistic are not independent. For example, if full-house hands are in surfeit, there has to be a compensating deficiency somewhere else. How much should I worry about this?) For the actual Amazon results, S2 works out to 4.267×10–4. Now we can generate lots of batches of random Amazon poker hands, each batch consisting of 458 five-digit numbers, and calculate S2 for each batch. What proportion of them will have an S2 value exceeding 4.267×10–4? The answer, based on half a million batches, is 80 percent. Thus all the anomalies I thought I was seeing in those numbers are pure delusion.

Needless to say, I was rooting for another outcome. I would have enjoyed finding something spooky and inexplicable in the Amazon rankings. Instead, all I’ve proved is that my intuition about what random numbers look like is not to be trusted. This is a disappointment, but maybe I can salvage something from the ruins. Perhaps I can claim the discovery that the Amazon rankings are a fairly good source of random numbers.

Large-scale differences and movements in the rankings are surely not random. It’s not purely a matter of chance that my book’s current rank is 43,888 (what an interesting number!), while some preposterous tale about a pubescent wizard occupies the top of the list—even though that book hasn’t actually been published yet. (Do I sound bitter?) The rankings are nonrandom in another way as well: The first digits are not uniformly distributed but have a Benford or Zipf distribution, with an excess of ones and a shortage of nines. It’s interesting that even though the randomly generated numbers do not share this property—the first digits have a uniform distribution over the range 1 through 9—the poker-hand analysis shows that the frequency of pairs, triples, and other patterns is identical in the two data sets. Thus the digits to the right of the first digit do seem to have the statistical properties of random numbers. The source of the randomness is presumably the hourly reshuffling of the rankings by the actions of thousands of Amazon shoppers—actions that are all too predictable in the aggregate (curse you, Harry Potter) but quite random in detail.

Perhaps it’s worth noting that when the RAND Corporation prepared their famous book A Million Random Digits with 100,000 Normal Deviates in 1955, they also employed the poker test as a measure of randomness. I’m relieved to find that their calculation of the theoretical frequency of the seven types of hands agrees with mine. (They don’t say how they performed the calculation.) The current Amazon sales rank for the book is 2,668,928 (hardcover) and 281,270 (paperback).

Hermann Weyl, tax accountant

Monday, April 16th, 2007

It’s tax time for Usaians. I’ve been plodding through the thick book of forms and instructions, tips and cautions, tables and worksheets and schedules and Paperwork Reduction Act notices. The unwelcome annual ritual always reminds me of the words of Hermann Weyl:

Our federal income tax law defines the tax y to be paid in terms of the income x; it does so in a clumsy enough way by pasting several linear functions together, each valid in another interval or bracket of income. An archeologist who, five thousand years from now, shall unearth some of our income tax returns together with relics of engineering works and mathematical books, will probably date them a couple of centuries earlier, certainly before Galileo and Vieta.

We should forgive Weyl his peevish tone; Form 1040 can put anyone in a grumpy mood. And maybe he had a point. The tax code has changed in many ways since Weyl gave his contemptuous assessment in 1940, but after all these years the income tax is still defined by a piecewise linear function. Here are the current rates for single taxpayers:

2006 single tax rate chart

Graphically, the tax y as a function of taxable income x looks like this:

piecewise linear tax function

A few notes on this income-tax function:

  • It is a proper function in the mathematical sense of that term: Every income x ≥ 0 is mapped to a unique tax y.
  • The range of the function excludes negative incomes. In the real world, it’s quite possible to finish the year with a loss, but in tax land we are instructed: “Subtract line 42 from line 41. If line 42 is more than line 41, enter -0-.”
  • The function touches the origin (zero income means zero tax) and is monotonically increasing. There is no income level where you can earn more money and pay less tax.
  • The function is concave upward, which implies that the tax is progressive: People with higher incomes pay a higher proportion of their income as tax. But the progression stops at 35 percent: As x approaches infinity, the ratio y/x approaches a limiting value of 0.35.
  • As a function defined by “pasting together” several straight-line segments, it is continuous but not differentiable at the points where the segments meet.

Weyl’s complaint against the tax code might have had something to do with the last of these observations—the jaggedness of the curve, or the discontinuity of the first derivative—but I don’t think that gets to the heart of the matter. No one really cares whether or not the tax curve is twice differentiable. What bothered Weyl, I suspect, was the mere fact that the tax function is defined as a concatenation of segments. He wanted a curve defined by a single expression that could be evaluated in the same way throughout the range of the function. (By the way, does this concept have a name, other than “not piecewise”? Should we call it a piecefoolish function?)

In setting out to create such a function, an obvious approach is fitting a polynomial to the half-dozen points given in the tax-rate table. But this process is not quite as easy and straightforward as it might seem. A major problem is that there are really seven points in the table rather than six. The 35-percent bracket continues indefinitely, extending to arbitrarily large incomes, and so there’s really an additional point on the curve, somewhere out there where x = ∞ and y = 0.35x. If you ignore this extra point, the 35-percent bracket disappears entirely. If you try to place the point at infinity, its influence on a least-squares fit will overwhelm all the other points. I don’t know a good solution to this problem, so I’ve adopted an arbitrary one. Noting that the lengths of the tax brackets are roughly in geometric proportion, I’ve placed a seventh point at a position that maintains this approximate relation, x = 750,000, y = 242,360.50.

Let’s begin with a linear (i.e., first-degree) polynomial model of the tax brackets. Some people call this a “flat tax,” although that term seems to me more appropriate for a scheme in which everyone pays the same amount of tax; here, it means everyone pays at the same rate, or the same proportion of income. In any case, I don’t think the flat-tax fans would be enthusiastic about this flat tax:

graph of a linear fit to tax brackets

Following this formula, everyone with a taxable income below about $19,000 pays a negative tax, or in other words receives a subsidy from the government. For those with zero income, the subsidy is more than $6,000. Meanwhile, all those above the $19,000 threshold pay 32.6 percent of their income as tax.

A quadratic fit to the tax-table data looks slightly less inflamatory:

graph of quadratic fit to tax table

There’s still a bit of negative income tax, but it doesn’t kick in until income falls below $9,500, and the maximum subsidy is about $2,500. Overall, the curve is really quite a close fit to the data, with an r2 value of 0.9995. The largest residual error between the fitted curve and the data is $2,484, at the zero-income point. Perhaps one could use these linear and quadratic approximations to the tax-rate data to argue that the shape of the tax function “wants” to include a negative tax for the lowest incomes, but the official curve has been artificially cut off to prevent this. (The Earned Income Credit does allow some low-income taxpayers to have an effective negative tax, but the structure of the credit is different from that of the models discussed here. The EIC provides almost no benefit at zero income; the maximum negative tax is at an income of between $6,000 and $16,000.)

We can match the data points in the tax table as closely as we please simply by going to higher-degree polynomials, but this is not necessarily a good idea. A sixth-degree polynomial will thread itself through all seven of the specified points:

sixth-degree polynomial fit of tax-rate data

In between the specified points, the curve has some suspicious-looking lumps and sags. People earning about $250,000 seem to be getting a break, and those with incomes of $350,000 are paying a penalty. On taking a step back and looking at a broader range of incomes, the curve turns out to be far more bizarre:

graph of sixth-degree polynomial extrapolation

This is a tax function that Vice President Cheney might well appreciate. The Cheney family has reported that they had taxable income of $1.6 million last year; according to the sixth-degree polynomial, they should be due a refund—a negative income tax—of a little over $2 billion. Meanwhile, poor George W. Bush, who reported a meagre $642,905 of taxable income, is unlucky enough to find himself near the peak of that big hump in the curve. He would be asked to pay a tax of more than $1.5 million—almost three times his total earnings.

So maybe fitting a curve to the existing tax structure is not such a good idea after all. Let’s try to construct a curve that’s similar in overall form to the present tax brackets but not so rigidly constrained by the data in the table. The curve in the graph below is a quadratic that passes through three of the bracket-defining points, {0, 0}, {30650, 4220} and {336550, 97653}:

three-point quadratic-fit curve

Over the range of incomes shown, the shape of the curve is a reasonable match (at least by eye) to the piecewise tax function of the Internal Revenue Service. By construction, the function yields a zero tax for zero income and is positive everywhere else. It is concave upward. But there’s still a big problem:

quadratic fit to three points, extrapolation

The quadratic curve was constructed from data in the range between zero and a few hundred thousand dollars; outside that range, the function is free to go wild. In this case the slope of the curve becomes greater than 1 at about x = $1.8 million; beyond that income level, the tax rate is greater than 100 percent, so that the tax owed exceeds earnings. Even those of us who ardently want to soak the rich will have to admit that collecting such taxes might be difficult.

It begins to seem that Weyl’s request for a simple tax function is not so easy to satisfy. We can have a piecewise definition and make it do anything we want, but that is what Weyl was trying to get away from. We can have a flat tax, given by a first-degree polynomial, but many people think that would be socially and economically undesireable. With a polynomial of any degree higher than 1, the tax will eventually diverge either to +∞ or –∞.

On the other hand, we have certainly not exhausted the list of candidate functions. How about this one: y = xx1–ε, where ε is some positive number less than 1? Here’s what the tax curve looks like for ε = 0.027.

graph of y=x-x^(1-epsilon)

This is a fairly good match to the existing tax function over the range of x shown, and it doesn’t blow up at larger values of x. As x goes to infinity, y/x very slowly approaches 1 from below. (The tax rate reaches 90 percent for incomes above about 1037 dollars.) The function is concave upward, and it’s positive for all positive x.

In this formula the expression x1–ε could be replaced by any slowly growing function of x. An appealing candidate is log(x), but getting a sensible tax curve out of a logarithmic function is a bit of a chore. The first problem appears at the origin: log(x) diverges to negative infinity as x approaches zero from above, and so a small income will earn you an arbitrarily large negative tax. We can avoid the problem by working with log(x+1). A further difficulty is that log(x) grows so very slowly that over any wide range of incomes, y = x – log(x) doesn’t differ appreciably from y = x. The closest I’ve been able to come to an acceptable tax function is 0.335 x – 973 log(x + 1) + 1578:

graphxminuslogx

Here the tax is negative for any income below about $25,000.

The function xx/log(x), shown below, is somewhat better-behaved, although again fudge factors are needed to avoid singularities near the origin. (The actual equation being graphed here is –7.706(x+1)/log(x+2) + 0.893x – 2.297.)

graph of f(x) = x/log(x)

The mention of x/log(x) leads to a further thought, and a final fantasy. The function x/log(x) is well known as an approximation to pi(x), the function that counts the number of primes less than x. I think Weyl might have been pleased if the instructions to Form 1040 required you to enumerate the prime numbers up to your income level. This innovation would at last bring the federal tax code out of the age of Galileo and Viete, all the way up to Gauss and Riemann.

Postage due

Thursday, February 8th, 2007

There was a line at the Post Office window, so I went to the self-service counter, plopped my letter on the scale, and found that it weighed a whisker under two ounces. I bought stamps from the machine and stuck on a 39-cent and a 24-cent. I was just about to drop the letter in the slot when a thought struck me. I went back to the scale. Sure enough: With the stamps affixed, I was over the two-ounce limit.

I’m not going to tell you what I did next—whether or not I put an extra stamp on the envelope. That’s between me and my postmaster, and until they repeal the Fifth Amendment I have nothing more to say about it. But I will concede that my conscience may have been troubling me, because last night I dreamed of postal reform.

In my dream, the nation finally scraps the whole bizarre congeries of ad hoc step functions that currently define U.S. postage rates. Postage becomes a continuous function of a letter’s weight. (The current rate structure for domestic first-class mail appears to be a feeble attempt to approximate a simple linear function: P = 24W + 15, with the weight W in ounces and the postage P in cents.)

In the new regime we also dispense with the baffling collection of arbitrary stamp denominations. (Currently on sale at shop.usps.com: 1, 2, 3, 4, 5, 10, 23, 24, 37, 39, 48, 60, 63, 70, 75, 83, 84, 87, 100, 385, 405, 500, 1440. (It’s not in the sequence server, and please don’t put it there.)) Sweeping away all this cruft, my dream Post Office sells postage in continuous strips and sheets with a defined value per unit area. You cut off a piece of the stuff—we can call it postage-tape, or maybe stampage—exactly as large as you need to pay the tariff on a letter of any given weight.

Better still, instead of measuring postage by the area of the stamp, we can measure it by the weight of the stampage stuff. The marvelous thing about this scheme is that the postage rate becomes a dimensionless quantity. Whether you express it in grams per gram or ounces per ounce, it comes out the same. The rate is a pure number. Let’s suppose it’s r = 1/10, just so we have something definite to talk about.

Now, when I take my letter to the Post Office, if it weighs, say, 50 grams, I know that I have to apply 5 grams of postage.

But wait. Now the letter-plus-stampage weighs 55 grams, and so the correct postage is 5.5 grams. When I add another half-gram of stamp stuff, the new weight is 55.5 grams, and the correct postage amount is 5.55 grams….

You may think that this endless series of adjustments to adjustments is a drawback of my new postal pricing model. Au contraire! It is the principal advantage. The benefits extend far beyond the Postal Service and promise to transform American life and culture, and especially education.

There is a scene that plays out every day in classrooms all across the country—or so I’m told. A high school kid, bored with a lesson on the summation of series, protests bitterly: “Why do I need to know this stuff? No way am I ever going to sum an infinite series in real life.” Now we have an answer for that young nihilist. Do you want to stand in the Post Office all day with cuticle scissors, cutting ever-smaller slivers of tape as you try to approximate the postage due on a letter? Or do you want to learn once and for all that

\\sum_{n=1}^\\infty r^n = \\frac{r}{1-r}

Here is America’s last best chance to be taken seriously as an educated and cultivated society. Nobody’s going to mess with a country where you need to know a little calculus just to mail a letter.

Addendum. Toward morning my dream took a darker turn. What if postal rates keep rising? Beyond, say, r = 1?