Spam stats

Hormel Foods, the Minnesota meatpacker, reports a surge in sales of Spam. News accounts attribute the rising popularity of the pink meat-in-a-can to higher prices for other commodities. Or maybe it’s the Spam musubi fad.

Meanwhile, the other kind of spam seems to be surging as well. I’ve been keeping track of my personal spam consumption for the past five years. (I first wrote about this in 2003, with a follow-up in 2007.) Here’s a record of the total number of messages landing in my spam bin each month since the start of 2007:

spamvolume.png

The lull last spring gave me some hope that spam was finally in decline; the monthly intake even fell below 1,000 messages in March and April. But the respite didn’t last. There was steady growth through last summer and fall, and now another spike in volume has brought the rate to nearly 3,000 messages per month.

The message counts charted above lump together spam sent to several email addresses. Here’s a breakdown by address, covering the entire 17-month period:

mailboxes.png

The two addresses that attract the most unwanted traffic—namely, my address here at bit-player.org and another at amsci.org—are both published openly on the web, without any form of obfuscation. So are the addresses identified in the pie chart as “il-perms” and “il-prints”; they appear on my industrial-landscape.org web site. I’m certainly not surprised that spammers have discovered these addresses; they are fair game to anyone who knows how to scrape a web site. But there are still some puzzles in the data. I have several more email addresses that are equally vulnerable—they are published in the same places—but they receive nary a spam. Why not? And my earthlink.net and acm.org addresses are not published (or even much used), yet they get a healthy share of junk mail.

The content of the spam remains much the same—replica watches, blue pills, pirate software, phishing expeditions. Numbingly repetitious. In one week I got 25 messages with the same subject line: “eBay New Unpaid Item Message from snorelax67.” Then there were the 34 messages with subject lines such as “Viadzgra - $1.20,” “Viabqgra - $1.75,” “Viafmgra - $1.09″ and “Viategra - $1.38.” (Evidently someone has written a little program to insert random letter pairs in the middle of the word. My spam filter was not fooled. Nor did it fall for “Hihg - qualiyt repliacs of the ebst lcock of the wrold!!”) In “How Many Ways Can You Spell V1@gra?” I argued that most of the world’s spam is coming from a relatively small number of senders—tens or possibly hundreds, but not thousands—and I think the evidence continues to support that conjecture.

One interesting trend in my spam is that it seems to be growing more cosmopolitan. Back in 2003, about 18 percent of the spam I received was written in languages other than English; the figure now is 34 percent. The distribution of languages is curious. Here are the data for May 2008, when I received a total of 933 non-English spams:

spamlangs.png

Does everybody get gobs of spam in Russian, or is it just me? Is there something about my Internet activity that leads mailing-list compilers to believe I read Russian? Well, here’s the sad truth: My knowledge of Russian is so totally lacking that I’m not even sure all those messages are really Russian. They come with a Cyrillic character encoding, but for all I know some of them could be Bulgarian or Ukrainian. I’m equally in the dark about the 153 messages that appear to be written in various Asian languages (Chinese, Japanese, Korean). As for the German messages, they are something of a novelty. Until a few weeks ago, I almost never saw spam in German, and now there’s a sudden spate. It’s pretty clear that all of it comes from the same source. I’m seeing no French spam, nor Portuguese, nor Hindi, Urdu, Arabic, Hebrew.

Linguistic diversity is laudable, and in general I’m pleased to see challenges to Anglophone hegemony. I’m always flattered when someone addresses me in another language—even if I can’t respond in kind. But in this case I’m afraid there’s no reason to be congratulating myself. The spammers are not sending me these multilingual documents because they take me for an accomplished and urbane polyglot. They’re sending them to me (and to millions of others) because selectivity just isn’t worth the bother. Addressees like you and me are too cheap to count. Spam is becoming something like the cosmic microwave background radiation. It’s everywhere, it’s meaningless, it can be mistaken for birdshit.

Update 2008-07-01. More pink meat. I’ve tallied up the receipts for June, and my personal spam volume has set a new record: 3,354 messages, an increase of 20 percent over the previous high of 2,794 in May. The updated graph now covers 18 months:

spamvolume701.png

It’s worrisome to see the quantity growing so fast, but let me try to put the matter in perspective. Alongside the 3,354 spams I received in June, I also received 1,245 nonspam messages. Thus the proportion of spam is about 73 percent—well under the figure of 90 percent that’s often bandied about by companies that sell anti-spam products and services. Moreover, the spam causes me very little actual bother; almost all of it goes directly into the junk folder without need for human intervention. The nonspam messages, on the other hand, demand to be read and responded to. Perhaps I’d get more accomplished if more of my mail were spam.

I have not done a language analysis of the new batch, but I can tell at a glance that I’m still attracting a bizarre glut of Russian spam. A subject line that caught my eye reads:

programspam.png

I can sound out just enough Russian to guess the transliteration “programme spam.” Inside the message is an image of an advertisement (also in Russian) for various warez. But the decoy text that’s meant to get the message through the spam filters is a sports story written in German. Thus even individual messages are now becoming multilingual.

Update 2008-09-01: When I started this thread back in the spring, I thought I was taking note of a step function in the spam rate—a sudden jump from 2,000 a month to a new plateau at 2,500 a month. The trend looks different now: not a series of steps but sustained steady growth, with an increment of roughly 500 a month:

spamvolume901.png

Total spams received in my various inboxes came to 3,886 for July and 4,489 for August.

And I continue to be amazed and baffled by the quantity of Russian-language spam. The proportion of my spam written in a Cyrillic alphabet is now above 40 percent. The growth in Russian-language messages accounts for about two-thirds of the overall increase in the past few months. Should I read some geopolitical meaning into this trend?

This entry was posted in modern life, statistics.

6 Responses to Spam stats

  1. Barry Cipra says:

    Just curious: How did you choose the ordering of the wedges in your pie charts? In particular, why place German between two romance languages?

  2. brian says:

    I was hoping you would ask. It’s such a fascinating story!

    I counted those emails by hand, scanning 2,715 subject lines in chronological order. The first non-English message I came to was in Russian, so I listed that category first. Next came a message in an Asian language. Do you begin to see the pattern? The first German email happened to arrive after the first Italian one but before the first Spanish one. (The “unknown” category is something I split off from the Asian group after the fact; I’m pretty sure those messages are in some form of Japanese.)

    By the way, I never want to do this kind of counting again. I had tried to automate the language labelling by extracting strings such as “charset=iso-2022-jp” (which designates a Japanese character encoding). The results were totally bogus. Most of the messages have no such encoding designators, and many of them have misleading markers, or malformed ones.

    Next time — if there is a next time — I’m going to writing a little n-gram language recognizer. Or else figure out how to abuse the Google language tools and have them do the job for me.

  3. Jim Ward says:

    Re: The “mistaken for birdshit” link, I was listening to Feynman’s “What Do You Care What Other People Think?” yesterday and I was interested to learn that in his day physics was divided between experimentalists and theorists. The experimentalists would discover some odd fact, and then go to the theorists for an explanation. It looks like June 1963 was the exact point when the experimentalists and theorists caught up with each other. Today the theorists have lapped the experimentalists, which explains the slow progress of physics today. Too much theory, not enough data.

    I also learned of tang and clevis joints, in case you want to write “Physics in the Bedroom”. De Sade has beaten you to “Philosophy”.

  4. Jim Ward says:

    Speaking of science and sex, Mary Roach has a new book out, “Bonk”. Her previous books, “Stiff” and “Spook” were pretty good. I only have one recommendation for a group theory book …