Spam by the numbers

Reviewing this month’s batch of incoming junk mail, I stumbled upon the following message:

numberspam440.png

In case that image is too tiny to read, here is the first word in source-code form:

     28    47   34
     74    33
      85  42 
      16  43    25    5048     08124   8813    2714             
      34  02    25       66   50  31   855        05  
       3404     65    88362   00  25   72      01651    
       8008     36   42  77   27  81   06     04  40
        72      83   02  32   47  12   24     87  33  
        78      03    87100    83844   18      21813   
                                  08
                              73634 

The basic technique is anything but novel. I can remember green-and-white-striped printouts that had my name emblazoned in the same kind of two-inch-high characters. But why are the characters here formed entirely out of numbers, rather than other ASCII glyphs? And do the numbers themselves mean anything?

I think I know the answer to the first question: The spammer thought a message composed of nothing but numerals might slip through the spam filters. (In my case, at least, it didn’t work. I fished this message out of the garbage pail.)

As for the second question, my immediate guess was that the digits are the output of some simple pseudo-random number generator. That would be an easy way to produce them, and it would also allow the spammer to make each individual message unique. On taking a closer look, however, I realized there was something quite nonrandom about the numbers in the message.

Here is the full list of digits. There are exactly 900 of them. Do you see what’s missing?

284734807433341016202332628542642574418481303116432550480812488
132714721846667434022566503185505580464271163634046588362002572
016511712427000046735580083642772781060440148383627872830232471
224873301464000807803871008384418218130077346262602008225346571
155727363470732323181618223162744253246331737038301533254837881
148802160371074555632302255640217448457046416116253484658726108
147181540061231788804563557807254177278106044014838362787283023
247122487330146400080780387100838462042135220046847482422143746
770236783058460185444521283134537306537546855305024142275437615
010235002438258320577785451436776143066166025853832747551576004
831136831376228235381112678466011047530048032816623514158481030
413446024450055236762111281250031205166204213522004684748242214
374677023678305846018544452128313453730653754685530502414227543
761501023500243825832057778545143677614306616602585383274755157
600483113683137622

There’s nary a 9 in the bunch. And in other respects too the digit distribution looks slightly off-kilter:

digitdist.png

When I tabulated all the correlations between successive digits, that too looked a little fishy, although the sample is too small for any reliable conclusions.

                   s e c o n d
           0  1  2  3  4  5  6  7  8  9
      0   23 12 20  9 17  9  7  7  8  0
      1   11 13 11 12 16  8 13  5 10  0
   f  2   11 11 13 15 14 15  6 14  9  0
   i  3   18 13 15  7 11  8 13 13 12  0
   r  4    8  9 12 13 12 10 22 11 18  0
   s  5   11  7  5 14 12 14  4 10 11  0
   t  6   12 10 15  6  8  7 10 10  6  0
      7    6 10  9 10 12  7  9 11 14  0
      8   12 14  8 24 13 10  0  7  6  0
      9    0  0  0  0  0  0  0  0  0  0

So what’s going on here? I think the pseudo-random generator is still a leading candidate, though it would have to be a badly implemented RNG. The absence of 9s isn’t hard to explain: We only have to suppose that the spammer was working in C and wrote the plausible-looking expression random(9), thinking that would generate integers between 0 and 9.

On the other hand, maybe it isn’t random. Maybe there’s a secret message-within-the-message. Anybody see a pattern?

While I’m talking spam, I’ll update my ongoing tally of my inbox contents. I can report that September was a good, strong month for spam, with further steady growth continuing the summer-long trend. The stock market is in retreat and credit is tight, but the purveyors of replica watches are undeterred. My receipts have crossed the 5,000-per-month threshold for the first time:

spamcounts.png

And another threshold has also been left behind: For the first time this month, more than half of my spam is written in Russian. (Based on character-set declarations, 2,858 messages out of 5,021 were in Cyrllic scripts, or about 57 percent.)

Update 2008-10-12: In response to a request in the comments, I’ve uploaded the full text (including headers) of the original email. The file is here. Incidentally, I’ve searched my spam archive for other messages like this one, without success. That in itself makes this a peculiar spam. Usually, if I get a spam once, I see dozens of copies or variants within a few days.

Posted in modern life, statistics | 12 Comments