July brought quite an impressive spam storm, which dumped 10,738 messages on me:
That’s a record for my inbox, well beyond the spike of 7,506 messages received last October. The mean number of messages over the two-year period shown is 3,867; the standard deviation is 2,190.
I’m intrigued by the amount of noise in this signal. The magnitude of the fluctuations suggests to me that somewhere in the spam economy there is a small-number bottleneck. Maybe there are only a few high-volume spammers in the world, so that when one of them goes on vacation, the overall volume sinks dramatically. Or maybe there are only a limited number of customers willing to pay for big spam mailings, so that the renewal or cancellation of a single contract can have a noticeable impact. Or it could be that there are only a few major lists of harvested email addresses; when the scraper misses one of my mailboxes, I see a big change.
A bottleneck of this kind is not the only possible explanation. In some other areas with high volatility–the stock market, for example–the apparent cause is not a small population of agents but strong correlations between agents, who all follow the same signs and signals. I suppose that might happen in the spam market, too, with broader economic trends affecting everyone in the same way, but somehow it seem less likely.
The graph below breaks down the monthly totals according to which of my various email addresses the spam targeted. Again the numbers seem to be bouncing around pretty wildly. For example, the July spike mostly came from my address here at bit-player, but the peak last October was dominated by an address at amsci.org.
Of the seven addresses I monitor, five are openly published on the web, and thus I shouldn’t be surprised that they attract their share of spam. But the other two addresses have never been published, and one of them I have never used or even handed out to friends. Those obscure addresses are getting about 1,000 spams a month in total.
One feature of my spam that doesn’t seem to fluctuate much from month to month is the proportion written in Russian or other languages that use the cyrillic alphabet. The fraction has hovered near one-half for the past year. I have a hard time imagining a model that produces such linguistic stability along with volatility in other dimensions.
There was a very well publicized event last November when McColo Corp. was shut down and spam volume dropped 75% before recovering. (It was around mid-Nov, which is why the month-to-month drop in your graph is less than 75%, and also why you see another dip in Dec even though traffic had started recovering by then.) There is a bottleneck not in terms of people but hosting companies doing the spamming. This is also the reason why the language ratios are stable — the spammers and their hosts are probably uncorrelated in terms of geographic location. Apparently this kind of event happens once every couple of years, but everyone migrates to a different host and all is well in the spam world again.
Another (improbable) factor might be that spammers (or their funding corp.) are aware of (the constantly updated) GMail (or Yahoo etc) spam filters and they have dumped older strategies (explaining the fall) or come up with newer (higher success probability) strategies (explaining the rise).
Again, very unlikely :) … but then you never know how much research is going on in another area!!
To compute the probability that the message is spam, taking into consideration all of its words or a relevant subset of them.I am satisfied by your way of research.
Read section 21.2.2.4 of Ross Anderson’s book, Security Engineering, 2nd Ed., Wiley, 2008, where he talks about the ‘lumpiness’ of spam statistics. He says that most spam comes from several dozen large gangs.
Anderson’s website is worth a visit :
http://www.cl.cam.ac.uk/~rja14/