Preventing spam in forums/blogs/comments/etc     

Spam. If you know a spammer, kick him in the balls for me. Then kick him in the balls for every spam you've ever received. No really, $100 if you send me video.

I wrote my own message board a while back, and I hoped that since it was custom and out of the way that the spammers wouldn't catch on to it. Boy was I wrong. Within three days there were dozens of spam messages, all trying to up their Google return link ranking. It turns out that they just look for form's with textarea's in them and they just post to them hoping they'll get a return, they don't really care what sites they're on.

Well, I'm smarter than some asshole spammer, so I figured out a pretty good way to block them.

  1. Don't let them post too fast. Even a two second delay is enough to prevent a basic automated poster.
  2. Make them load the posting page too, not just the target page.
  3. Make them run javascript. Yes, this also prevents some phones/lynx/etc from working, but in my case I've decided that it's worth it.

Well, that all sounds easy, right? Well, here's the code. I use encryption, because otherwise they could very easily reverse engineer the checks.

You'll need PHP to be compiled with mcrypt (http://php.net/mcrypt) and you'll need md5.js on your server.


parent.php

print('<script type="text/javascript" src="md5.js"></script>');// Let the client calculate MD5 hashes
$encrypt_key = md5('randomknownphrase_'.date('z'));// This checks to make sure they loaded the parent page. We have to pass the key, so we hash it with a known phrase, and a random variable. This should be pretty difficult to reverse engineer, since the key changes every 24h. If anyone loads the page at 23:59 and posts at 00:01 then it'll deny them though, so you could also use another known but changing value instead of z.
$form_num = rand(0, 5000);// Just to make it a bit tougher, the form name randomizes each time. It really doesn't make it harder on us, but it makes it marginally harder on an attacker.
print('<form action="target.php" method="post" name="form_'.$form_num.'">');// Set up the form, with our random id as the name
$iv = mcrypt_create_iv(mcrypt_get_iv_size(MCRYPT_TWOFISH, MCRYPT_MODE_ECB), MCRYPT_RAND);// We need the same IV for both encrypt and decrypt, so we create it here.
print('<input type="hidden" name="hash0" value="'.bin2hex($iv).'">');// We pass it over in cleartext, the php.net site says this isn't a security risk. bin2hex() makes it hex values instead of binary, so it can be passed by HTML
print('<input type="hidden" name="hash1" value="'.bin2hex(mcrypt_encrypt(MCRYPT_TWOFISH, $encrypt_key, time(), MCRYPT_MODE_CBC, $iv).'">');// We encrypt our value, we only want to know the time() they loaded this page, but you could encrypt anything you want.
print('<input type="hidden" name="hash2" value="">');// This is the client-side javascript check, it gets filled in when they click "submit"
print('<tr><td valign="top"> </td><td><input type="submit" onclick="form_'.$form_num.'.hash2.value=hex_md5(form_'.$form_num.'.hash1.value);">// Fill in the md5 hash of something we know, hash1 in this case but it could be anything. hash1 is guaranteed to be different every load though.
print('</form>'); 

target.php

$encrypt_key = md5('randomknownphrase_'.date('z'));// Regenerate the same key as parent.php
$iv = $_POST['hash0'];
$hash1 = $_POST['hash1'];
$hash2 = $_POST['hash2'];
// The posted values
if ($iv == '') {
     print('Error: hash0 is required');
} else if ($hash1 == '') {
     print('Error: hash1 is required');
} else if ($hash2 == '') {
     print('Error: hash2 is required');
} else {
// Check to see that they were passed in, many spambots get blocked right here.
$post_time = mcrypt_decrypt(MCRYPT_TWOFISH, $encrypt_key, pack('H*', $hash1), MCRYPT_MODE_CBC, pack('H*', $iv));// un-bin2hex's hash1 and decrypts it
if ($post_time == '') {
     print('Error: You are not authorized to do that');
// If it didn't decode at all, throw an error. Note that I use the exact same generic error for all of these, the less info you give them on their error the harder it is to reverse engineer your checks.
} else if (is_numeric($post_time)) {
     print('Error: You are not authorized to do that');
// Didn't decode to a number
} else if (time() - $post_time > 60*60) {
     print('Error: Your posting session has expired, please go back and try again.');
// If they waited more than an hour between loading the form and posting
} else if (time() - $post_time < 0) {
     print('Error: You are not authorized to do that');
// If they tried to set a posting date in the future
} else if (time() - $post_time < 5) {
     print('Error: Whoa there cowboy, please post a bit slower. This feature is to block all those asshole spammers from polluting our message board.');
// If they waited less than 5 seconds before submitting. Most people will take more than 5 seconds to open, read, fill out the form, and click submit. You can tune this, but most bots take 0 seconds.
} else if (md5($hash1) != $hash2) {
     print('Error: You are not authorized to do that');
// If the client side javascript didn't calculate the md5 correctly
} else {
// post the message
}
// If they make it here then you post it to your DB, etc.

Results

After running this setup for a little over a week I had:
Non-Spam:161
Spam:7448
Spam rate:4626%
Error by no_hash0:15.9%
Error by no_hash2:83.3%
Error by post_bad_hash2:0.0%
Error by post time in future:0.1%
Error by post time too fast:0.7%
So you can clearly see that it works quite well. About half of the automated spammers out there only request the <form> page once, and then store what values to submit. The other half load the page, but they don't run client side javascript. So far there have been zero false negatives, and zero false positives, but I'm sure they'll happen so the code will have to be tuned in the future.

The best part about this method is that it is transparent to the user. They don't have to decode some cryptic image or answer a riddle, they just behave like normal. Of course, that means the spammers can still fix their code as well, but it'll mean more CPU and more time on their end, which makes it less profitable.

This code is released to the public domain, you can use it without restriction. I'd love to hear if you're using it or if you have suggestions though! kallahar@kallahar.com

  Site by Kallahar - kallahar@kallahar.com - Hosted by DreamHost