Spam by the numbers

Reviewing this month’s batch of incoming junk mail, I stumbled upon the following message:

numberspam440.png

In case that image is too tiny to read, here is the first word in source-code form:

     28    47   34
     74    33
      85  42 
      16  43    25    5048     08124   8813    2714             
      34  02    25       66   50  31   855        05  
       3404     65    88362   00  25   72      01651    
       8008     36   42  77   27  81   06     04  40
        72      83   02  32   47  12   24     87  33  
        78      03    87100    83844   18      21813   
                                  08
                              73634 

The basic technique is anything but novel. I can remember green-and-white-striped printouts that had my name emblazoned in the same kind of two-inch-high characters. But why are the characters here formed entirely out of numbers, rather than other ASCII glyphs? And do the numbers themselves mean anything?

I think I know the answer to the first question: The spammer thought a message composed of nothing but numerals might slip through the spam filters. (In my case, at least, it didn’t work. I fished this message out of the garbage pail.)

As for the second question, my immediate guess was that the digits are the output of some simple pseudo-random number generator. That would be an easy way to produce them, and it would also allow the spammer to make each individual message unique. On taking a closer look, however, I realized there was something quite nonrandom about the numbers in the message.

Here is the full list of digits. There are exactly 900 of them. Do you see what’s missing?

284734807433341016202332628542642574418481303116432550480812488
132714721846667434022566503185505580464271163634046588362002572
016511712427000046735580083642772781060440148383627872830232471
224873301464000807803871008384418218130077346262602008225346571
155727363470732323181618223162744253246331737038301533254837881
148802160371074555632302255640217448457046416116253484658726108
147181540061231788804563557807254177278106044014838362787283023
247122487330146400080780387100838462042135220046847482422143746
770236783058460185444521283134537306537546855305024142275437615
010235002438258320577785451436776143066166025853832747551576004
831136831376228235381112678466011047530048032816623514158481030
413446024450055236762111281250031205166204213522004684748242214
374677023678305846018544452128313453730653754685530502414227543
761501023500243825832057778545143677614306616602585383274755157
600483113683137622

There’s nary a 9 in the bunch. And in other respects too the digit distribution looks slightly off-kilter:

digitdist.png

When I tabulated all the correlations between successive digits, that too looked a little fishy, although the sample is too small for any reliable conclusions.

                   s e c o n d
           0  1  2  3  4  5  6  7  8  9
      0   23 12 20  9 17  9  7  7  8  0
      1   11 13 11 12 16  8 13  5 10  0
   f  2   11 11 13 15 14 15  6 14  9  0
   i  3   18 13 15  7 11  8 13 13 12  0
   r  4    8  9 12 13 12 10 22 11 18  0
   s  5   11  7  5 14 12 14  4 10 11  0
   t  6   12 10 15  6  8  7 10 10  6  0
      7    6 10  9 10 12  7  9 11 14  0
      8   12 14  8 24 13 10  0  7  6  0
      9    0  0  0  0  0  0  0  0  0  0

So what’s going on here? I think the pseudo-random generator is still a leading candidate, though it would have to be a badly implemented RNG. The absence of 9s isn’t hard to explain: We only have to suppose that the spammer was working in C and wrote the plausible-looking expression random(9), thinking that would generate integers between 0 and 9.

On the other hand, maybe it isn’t random. Maybe there’s a secret message-within-the-message. Anybody see a pattern?

While I’m talking spam, I’ll update my ongoing tally of my inbox contents. I can report that September was a good, strong month for spam, with further steady growth continuing the summer-long trend. The stock market is in retreat and credit is tight, but the purveyors of replica watches are undeterred. My receipts have crossed the 5,000-per-month threshold for the first time:

spamcounts.png

And another threshold has also been left behind: For the first time this month, more than half of my spam is written in Russian. (Based on character-set declarations, 2,858 messages out of 5,021 were in Cyrllic scripts, or about 57 percent.)

Update 2008-10-12: In response to a request in the comments, I’ve uploaded the full text (including headers) of the original email. The file is here. Incidentally, I’ve searched my spam archive for other messages like this one, without success. That in itself makes this a peculiar spam. Usually, if I get a spam once, I see dozens of copies or variants within a few days.

This entry was posted in modern life, statistics.

12 Responses to Spam by the numbers

  1. rouli says:

    Well, I see a pattern, but I don’t know what it means.
    Every now and then, you see that the same digit repeats consecutively three times. The first occurrence is 333 after the ten first digits. This “phenomena” occurs roughly every 64 digits:

    2847348074
    3334101620233262854264257441848130311643255048081248813271472184
    6667434022566503185505580464271163634046588362002572016511712427
    000046735580083642772781060440148383627872830232471224873301464
    0008078038710083844182181300773462626020082253465711557273634707
    32323181618223162744253246331737038301533254837881148802160371074
    5556323022556402174484570464161162534846587261081471815400612317
    888045635578072541772781060440148383627872830232471224873301464
    00080780387100838462042135220046847482422143746770236783058460185
    4445212831345373065375468553050241422754376150102350024382583205
    7778545143677614306616602585383274755157600483113683137622823538
    1112678466011047530048032816623514158481030413446024450055236762
    1112812500312051662042135220046847482422143746770236783058460185
    4445212831345373065375468553050241422754376150102350024382583205
    7778545143677614306616602585383274755157600483113683137622

  2. Jeff says:

    Maybe this is the spam version of a numbers station (http://en.wikipedia.org/wiki/Numbers_station), a way for spy agencies to send coded messages out to the field. Spam, like shortwave signals, is part of the ether and can be received anonymously. As an added bonus, the spy agency could raise funds by selling pharmaceuticals to silly people!

  3. Brent Castle says:

    Here is a plot of the ordered pairs. The x axis is the index and the y axis is the pair. Some patterns are evident. I can’t imagine that they are an artifact of the rng, but its possible.

    http://www.cs.indiana.edu/~bscastle/t2.pdf

    One particular pattern
    http://www.cs.indiana.edu/~bscastle/t2-zoom.pdf

    This zoom highlighted a huge copied section. Maybe it was an rng + copy & paste!

    284734807433341016202332628542642574418481303116432550480812488
    132714721846667434022566503185505580464271163634046588362002572
    016511712427000046735580083642772781060440148383627872830232471
    224873301464000807803871008384418218130077346262602008225346571
    155727363470732323181618223162744253246331737038301533254837881
    148802160371074555632302255640217448457046416116253484658726108
    147181540061231788804563557807254177278106044014838362787283023
    2471224873301464000807803871008384

    620421352200468474824221437467702367830584601854445212831345373
    065375468553050241422754376150102350024382583205777854514367761
    4306616602585383274755157600483113683137622

    823538111267846601104753004803281662351415848103041344602445005
    523676211128125003120516

    620421352200468474824221437467702367830584601854445212831345373
    065375468553050241422754376150102350024382583205777854514367761
    4306616602585383274755157600483113683137622

  4. rouli says:

    how could I miss the copy&paste?
    hmm, another pattern - look for example at the beginning of the following two lines
    5556323022556402174484570464161162534846587261081471815400612317
    888045635578072541772781060440148383627872830232471224873301464

    note that many times, to go from a digit in the first line (x), to the digit right below it (y) you would do y=(x+3) mod 9 (remember that 9 is not playing). For example 5->8, 6->0, 2->5 and so on.

    If that’s an RNG, it’s broken :)

  5. John Cowan says:

    Check out FIGlet if you are nostalgic for ASCII art. It’s free, there are lots of fonts (though not a randomish-number one), and it even does Unicode.

  6. Iain says:

    I’m not sure why you think the histogram that you’ve plotted looks “off-kilter”. It looks entirely typical of 900 uniform draws from the integers 0 to 8. Remember the error bars are going to be roughly sqrt(size of bar).

  7. brian says:

    @ rouli and Brent Castle: Truly bizarre! I certainly didn’t see those patterns, I didn’t expect them, and I have no idea what (if anything) they might mean.

    @ Iain: What caught my eye was not the variance in the digit counts as the suggestion of a systematic bias in favor of smaller numbers, or against larger ones. But you’re right. The null hypothesis is certainly not excluded.

  8. Mike Kenyon says:

    This reminds me of the Arizona Lottery’s “Pick 3″ incident in 1998. In the Pick 3 game, players select a number from 000 to 999. But a programming error made it impossible for the digit 9 ever to be part of a winning number. See, for example, “The Case of the Missing Lottery Number” by Bill Kaigh in the January 2001 issue of the College Mathematics Journal.

  9. Gerry says:

    Since the digit 9 is not being used, maybe this is actually some number in base 9. Perhaps if you convert it to base 10 you’ll recognize it as the expansion of pi or square root 2 or some other well-known number.

  10. Jim Ward says:

    In the Pick 3 incident about 27% of the numbers had no chance of winning. I guess you couldn’t make money by buying all of the remaining numbers, but it sure would improve your odds!

    http://www.jstor.org/sici?sici=0746-8342(200101)32:1%3C15:TCOTML%3E2.0.CO;2-J

  11. brian says:

    @Gerry: Interpreted as a base-9 integer and converted to base-10:

    21557800711710455255538736971189067184979311434258865
    78308480799809889243093913713456201939041957299038496
    92522829906262673817030272601845718722434818441532642
    01186145590928387458644335520433398067558904370544869
    11763390110513489405480553393465278460216602842071182
    94392819007365124083927531356467573730341367712464596
    73578972318488741099211613469405646022394034975370695
    44070083245988659360516848783052576414226520984015498
    00910118635540135328878932494424012743673769177999175
    93166457665919619691025965877917697701459299617821176
    22721203343916785719267300656350767183003001234021350
    68030470193474610044996388880003812710061908728188339
    09815344925956620857123514756375124734531814289772807
    43939311204666991296542066037159080414685309686457513
    27221078741358518055477983355859939062169804166807729
    47288413808548687516349613282353013379229827161902064
    95365179005

    I’m not seeing pi or root 2, but who knows?

  12. rouli says:

    and of course, even in base 9, pi should not have a repeated pattern within it.
    Let the n’th digit be d(n).
    If you’ll do an histogram of d(n)-d(n-64) (mod 9) you’ll find that 3 is by far the most probable outcome, and there are no occurrences of 7 or 8 (if I recall correctly).

    Any chance of uploading the original spam mail?