I always viewed blog spam as a complex problem which requires a multi-layer solution. There is simply no silver bullet that stops all the spam and gives you no false positives. Most conventional approaches can be grouped together into 6 categories:
- Turing Tests
- Reverse-Turing Tests (or robot detection)
- Spammer Annoyances
- Trackback Validation
Each category has benefits and flaws and none of them can guarantee you that no spam will slip through. Below I will discuss all these categories. I will link to different WordPress plugins, because this is what I use. Feel free to chime in with solutions for different blogging platforms in the comments.
The most commonly used, and perhaps the most controversial method of combating blog spam is the good old Turing Test. It is usually a challenge which can be easily and effortlessly solved by humans, but is difficult for machines. Best example here is CAPTCHA. Usually CAPTCHAS are those little blurry images which contain words or numbers that you are supposed to type into a box to post. I’m using one on this very blog – so if you scroll down to the comment box you will see a perfect example. CAPTCHA’s work because OCR technology is not perfect, and funky fonts and small distortions can easily full all character recognition algorithms.
Unfortunately CAPTCHA has some downsides. It can be annoying to the users, especially if it is illegible. It is also a usability issue. Blind people, or people with bad sight who rely on screen readers can’t solve CAPTCHAS. Thus, if you use one, you might be permanently denying a segment of your readership ability to comment. Of course some CAPTCHAS use an audio challenge. I even seen CAPTCHAS which ask users to solve little math or contextual problems by filling in the blank word, or calculating a result of a simple equation. Unfortunately graphical CAPTCHA is a de-facto standard because it is easy to implement, easy to solve, and most effective.
Still, they are not perfect. There are ways around CAPTCHA’s – you can use different session hijacking tricks, you can use cutting edge OCR algorithms to break them, or you can simply harness the collective stupidity of MySpace users to solve them for you.
Reverse Turing Tests take the opposite approach. Instead of making the poster prove that they are human, they try to detect that the post is being made by a robot. There are many approaches to do this. For example, the Bad Behavior wordpress plugin analyzes the user agent string, and different behaviors to identify non-human posters.
Personally I love Bad Behavior but it appears that on some high traffic sites the plugin can behave erratically. For example, Shamus who runs the hilarious DM of the Rings comic, had major problems with human commenters being blocked for no reason. He eventually stopped using the plugin for that reason. So false positives are a big issue here.
Next approach is something that I call Spammer Annoyance. The idea is simple – make the life of a spammer more difficult and he will move on to an easier target. CAPTCHA already fulfills that function – it is a barrier that prevents the bottom feeders to user their crappy scripts to post hundreds of comments per second. But since CAPTCHA’s are annoying people search for different solutions.
One good approach is to require a hash of the posts text being submitted along with the post. The HashCash plugin is a good example of this method. Why does it work? Because to post, the spammer will have to include the hashing algorithm in his script. Since hashing is resource intensive, the resulting script will work slower than normal, making the spamming less efficient. Most people don’t care enough to bother with these things – they simply move on to easier targets.
Of course there are other, simpler ways to annoy spammers. For example, Travis proposes to rename the fields in your comment form with some random strings (maybe randomly generated ones? what do you think Travis?). Of course a well constructed script would simply crawl the DOM and grab appropriate fields, no matter what they are called. But most spamming shmucks out there simply use dirty hacks done with WWW:Mechanize and grab fields by name, using standard naming conventions.
Filters – are classic spam fighting tools that are more or less successfully applied to combat email spam out there. They utilize Bayesian filtering, pattern recognition and many other tricks to recognize spam, and prevent it from being posted. Personally I like Akismet which uses a centralized approach. Every user of this plugin submits their confirmed spams to the central database – so the algorithm is simultaneously trained by spam collected over hundreds if not thousands of different blogs.
If you don’t like to rely on some centralized 3rd party database for your filtering you can use something like Spam Karma 2 which is localized, and trains only on your spam.
The obvious problem with filters is that they usually tend to produce false positives every once in a while. For example, almost every recent comment made by my friend Miloš has been eaten by Akismet – completely without rhyme or reason. He thinks it had something to do with my old CAPTCHA plugin, but I doubt it. I think it was just Akismet being weird.
Whatever plugin or algorithm you use, you will have false positives, so you need to monitor all the spam comments and keep them in some sort of moderation queue. Unfortunately if you get 200 spam posts per day, you are likely to dump them all without looking and simply hope that there were no false positives there. This is a big issue.
Because of the spam issues some people require registration to post comments. This is a form of a whitelist. Only certain group of users is allowed to post. Personally I don’t like this approach because registration – especially one with email confirmation is annoying. I would say it’s about 100 times more annoying than the worst CAPTCHA imaginable – and I usually don’t bother leaving comments in “members only” threads.
I do use blacklists in a limited way. Whenever I see large volumes of spam coming from the same IP I simply block it in the .httaccess file. It prevents that IP address from ever accessing my website, so I use this sparingly and only for the most notorious spammers. Another good approach is to block all open proxies as they are frequently used by spamming scripts. I don’t do this, because I want to allow my users to read this blog and post comments anonymously, but not everyone cares about that.
Finally, there is Trackback Validation. Some spammers aim to bypass captchas, annoyances and most other methods described here by avoiding the comment form altogether. They simply generate false pingbacks to create trackbacks to your blog. Since all the pingbacks are associated with a linking URL a simple way to prevent getting hit by mass-generated spam is to verify that this URL actually exists. I use the Simple Trackback Validation Plugin to do this.
Of course this doesn’t work if the spammer is sending the pingbacks using an actual blog created on blogger or wordpress or whatever. But, once again, it counts down on the amount of crap posted to your blog by the bottom-feeders with shitty perl scripts and no knowledge or cunning.
Personally I use a layered approach:
- My first line of defense is a CAPTCHA (Peter’s Custom Anti-Spam)
- I employ Simple Trackback Validation as a CAPTCHA like barrier for pingback spam
- I use Bad Behavior to test for robots
- Whatever slips through all the plugins above is usually scooped by Akismet.
This multi-tiered approach works great for me. The CAPTCHA and validation scripts combined with Bad Behavior mean that my Akismet queue is almost always empty. But when something does slip through my front line defenses, I can always rely on the collective wisdom of half the internet to capture it.
What do you use? Any suggestions for alternative tools to use for Movable Type, Text Pad and whatever else is out there?
[tags]blogging, blog spam, akismet, spam karma, simple trackback validation, wordpress, captcha, hashcash[/tags]