Un-Spamming WordPress

Something old, and something new.

First the new thing, which like many new things is trivial.  This blog currently has 3,664 comments, all but perhaps 100 are guaranteed to be spam.  Well I had put up a “sticky” post some months ago, which means that it stays at the top of the blog no matter what.  Guess what?  Of the 3,664 posts, 3,231 of them are attached to the sticky post.  A-ha.  Most of these spam-bots simply attack the first post listed on the blog.  So new anti-spam measure #1: create a spam magnet post and make it sticky.  Tell people not to comment on this post.  Periodically, dump all of the comments.

The old thing is profound.  Obvious, but profound.  WordPress is not the friendliest environment in which to make bulk changes.  This can be an issue when you need to target large numbers of obviously spam comments while protecting those few worthy comments form, you know, human beings.  So, using a statistical trick and a bit of laziness, you can wipe out vast tracts of weeds while preserving the wheat.  Look at the comments in reverse chronological order, which is the default.  The first comment that you see is the most recently posted.  Let us assume that the time and date a comment was posted have nothing to do with whether it is spam or not.

Glance at the comment to ensure it is spam, and believe me, most of the time, a glance is all it will take.  Spam senders are diverse — some send a lot and some send just a little.  Some send in bursts, while others send slow-Loris-like, trickling slowly but consistently to generate huge numbers regardless.  Still, the numbers will tell, and the lw of large numbers means that this tip works particularly well *just when you need it*.

Find the first spam comment.  Note its sender address.  Search your comments (using the search function for @baddomain.com, and when those results pop up, hit select all.  Then delete.  You’ll want to se the number of results to the maximum, 250 at last report.  You may only get seven, or three, or one on your first try.  But just by taking each spam comment in turn and then searching for similar senders, you will eliminate larger numbers sooner than smaller numbers.  Why?  Because at any point in the sequence of comments, you are more likely to run across a comment sent from a prolific sender than a more — shall we say — discerning spam sender.  You could go about in any order you like or none at all and statistically, the results would be the same.  No matter what you do, you will hit the worstest the firstest.  But the simplest thing to do, the one which takes the least work, is to simply accept the reverse chronological default listing and plod away at that.

Why is this such good news?  Because when you’re smart, you make things hard.  God knows I do, and I’m so smart that my shirt is inside out.  It’s easy to get wrapped up in a workflow of extracting counts by sender of spam comments, and then ranking by count, and then re-attacking once you have this highly valuable information.  You then proceed with an accurate list of the most prolific spam commenters, and you know that your list will eliminate the greatest amount of spam comments in the shortest amount of time, because you have them sorted on order count, descending.

But there’s a price to pay, and it is two0fold.  First, you;ve spent some amount of time and effort to generate that list.  I say that the effort to do that is *not* repaid by the slight increase in performance you *might* see as opposed to just going down the list and searching on each bad guy you find, letting the Law of Large Numbers do the heavy lifting.  Second, once you expend a great deal of effort on proving your intellect to yourself, you are *less* likely to follow through with the monkey work of searching and deleting, and less likely to finish what you might start.

In order to generate some data, I went ahead and did the search thing for a while, until I ran out of lines on my pocket Moleskine.  Seemed like a fair enough “nothing up my sleeve” stopping point.  But I’m about to go whack everything on the sticky post and edit it to apologize for any worthwhile comments that I may have deleted.

Here are some numbers for the curious:

(A)Rev Chronological First Found (B)Domain (C)Count in 131 (D)Projected Total = (C)*(3,661/131) (E)Actual Total
1 gmail.com 63 1762.1 618
2 mail.ru 87 2433.3 1845
3 popcas.ru 1 18 1
13 xmmail.ru 8 223.76 60
18 topazpro.xyz 1 18 4
19 yandex.ru 5 139.85 37
21 koreamail.com 3 83.908 3
22 spacecas.ru 4 111.88 24
28 urx7.com 1 18 1
36 bigmir.net 5 139.85 14
51 glmux.com 1 18 6
63 counsellor.com 1 18 1
71 priest.com 2 55.939 2

And a picture of the correlation:

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply