A new flavor of spam

I wrote about spam three years ago in the May-June 2003 issue of American Scientist. Since then, my unhealthy fascination with the stuff has been dulled somewhat by overexposure, but I continue to keep an eye on the subject. The volume of spam reaching my mailbox is now about 10 megabytes per month, almost all of it silently skimmed off into a “Junk” folder by the simple filter built into the Eudora software that I use for reading my e-mail. Much of the spam includes countermeasures intended to fool such filters. For example, many messages have gobs of random text, generated by plucking words or phrases from some arbitrary source. Here’s a sample:

He laughed as he ran his hand through the clustering curls of his hair and said us a long time therell be a pretty kettle of fish In pursuance of my aunts kind scheme I was shortly afterwards and jungle fevers and agues and every kind of thing you can mention As to his liverand sorry and so reliant upon me to be so too that nothing she could have said would have

As it happens, I have also written about algorithms for producing this sort of nonsense (“Computer Recreations: A progress report on the fine art of turning literature into drivel,” Scientific American, November 1983, pages 18-28.) It’s not impossible that the spammers learned how to create this blancmange from me. So perhaps the 10 megabytes per month that now comes my way is only fitting retribution.

The other morning, though, I received some spam with an unusual flavor. The “payload” — the part the spammer wanted me to read — is perfectly ordinary and not worth reproducing, but here is the block of filter-foiling random text (slightly reformatted for presentation here in HTML):

John was enjoying sleeping near the tree..
That dentist is not enjoying writing near my home..
Luke is missing jumping today..
That carpenter is practicing running at this time..
Weren’t those plumbers practicing talking last month?.
But this is where you come in: Between now and November, you, the American people, you can reject the tired, old, hateful, negative politics of the past. And instead you can embrace the politics of hope, the politics of what’s possible because this is America, where everything is possible..
I don’t miss jumping for three or four weeks..
That computer programmer isn’t enjoying swimming behind the post office right at this time.
I was missing jumping..
I don’t miss jumping for three or four weeks..
Is the scientist missing praying?.
I was missing jumping..
3.
Is the manager missing walking?.
Joe’s girlfriend generally misses laughing..
Did Alfred’s niece like playing well?.

This is not mere random text; it is random grammatical text. That is, the method for creating it is not to choose words at random but to generate sentences at random with a generative grammar of English. And the grammatical structures are nontrivial; those participles governing gerunds (“I was missing jumping”) are the sort of thing that give a computational linguist fits. The proper handling of such embedded sentences was one of the problems that led Noam Chomsky and his colleagues to abandon simple phrase-structure grammars and try transformational grammars. Someone has been working hard to get this right. They say that pornography has inspired and sustained many innovations in both technology and business practices; maybe spam will have similar effects on the little academic field of computational linguistics.

How interesting that with all the work the spammers have invested in grammar, they couldn’t get the punctuation right. Every sentence has two trailing marks.

Note that the paragraph in the middle (“But this is where you come in….”) is obviously something else entirely. Google reveals that it is lifted from John Edwards’s speech nominating John Kerry in 2004. A Google search for some of the other phrases in this message shows that I’m definitely not the first or the only one to notice the new style in spam. “Is the manager missing walking?” turns up in several collections of spam poetry (e.g., here and here).

Why do spam “authors” include this kind of glop in their messages? I think it can be understood as part of an arms race between spammers and filterers. When spamming began (and before filtering began), the spammer could send out millions of copies of the same message. But it’s easy to block such mass-produced prose; the filter can simply search for the specific, known text. Adding random gibberish lets the spammer send out millions of messages with no two alike. But then the filterer escalates: Genuine language can be distinguished from random text by a statistical analysis; a string of words such as “of his hair and said us a long time therell be” is highly unlikely in real speech. With the advent of random grammars I think we may be seeing the next round in this conflict.

This entry was posted in computing, modern life.

One Response to A new flavor of spam

  1. Charles Blackburn says:

    The random text in spam may not be random at all. It’s oddly reminiscent of the coded signals broadcast by radio from England to the French Resistance during World War II. Hence, “John has a long moustache” was the code alert for the D-Day landings on the beaches of Normandy. Maybe these modern-day versions are coded messages from the Office of Homeland Security?