Critical Section

Archive: November 25, 2004

<<< November 24, 2004


November 27, 2004 >>>

referral spam be gone

Thursday,  11/25/04  09:00 AM

spamDo you blog?  Have you noticed an increase in "referer spam"?  I have.  Seems like every day now I get about three sites which come along and "link" to every page on my blog.  It's become annoying, because I really enjoy looking through my referral logs, it's one of the best ways to find cool new blogs.  Anyway this morning I decided to do something about it; I wrote a little spam filter for my server logs.

What's all this about?  Well, every time someone requests a page from my server, it causes the server to write a log entry into a file.  As part of the request, there can be a "referer", which is the address of the page from which the request was made.  If the request was a result of a link, the referring page will be the page which contained the link.  If site X has a link to me, and someone clicks on that link, the log will have the URL of the page on site X.

Unfortunately there are dirtbags out there who exploit this as a means of publicity; they make bogus requests to my site giving the URL of their site as the referer.  Of course their site doesn't actually link to me, they just want me to go check out their site.  It is a very lame way of publicizing a site because 1) the only audience for the referers is a site's webmaster and 2) if s/he does visit the referred-to site they'll already have a really low opinion of the site operators.

So, what to do?  Well, I'm already piping logs through a filter, the great little [free] program called cronolog.  I just added another filter to get rid of referer spam.  Here's my new ErrorLog entry (in Apache's httpd.conf):

ErrorLog "| /var/log/httpd/ | /usr/local/bin/cronolog /var/log/httpd/"

This is all one line.  The webserver passes each log entry into the reffilter.ksh script (my new invention), which then passes each entry on to cronolog (which writes the entries into files named for the current year and month).  The reffilter.ksh script processes every log entry as follows:

  • If the entry doesn't have a "referer", pass it through.
  • If the entry's referer is my site, pass it through.  This happens a lot; links within the site, and links from pages on my site to image files.
  • If the entry's referer is not a well-formed URL, pass it through.  A lot of search engine robots and RSS feed readers give a bogus URL, these are actually nice to have, so I leave them.
  • If the referer is a well-formed URL which is not my site, I retrieve the page from the URL.  If this fails, I pass the referer through.  I don't mind having referers which I can't access (because they're password protected, or from an email system, or whatever).  No referral spammer would give a bad URL.
  • If I was able to retrieve the page, I scan it to find the reference.  If there's a link to my site, great, it was a legitimate referral, and I pass it through.
  • If there's no reference to my site - aha, I caught you.  I piss on you from a great height, and silently remove the referral from the log entry before passing it through.

So far this morning I have filtered 13 spams.  Very satisfying.  Yes, it is an odd way to spend Thanksgiving morning.  But then, I am odd, so there you are.

BTW, yes, "referer" is misspelled.  Someone at NCSA spelled it wrong at time zero, and now we're all stuck with it.  You've got to love that.

P.S. If you would like my little script for your own use, please shoot me email, I'm happy to share.


Return to the archive.

this date in:
About Me

Greatest Hits
Correlation vs. Causality
The Tyranny of Email
Unnatural Selection
On Blame
Try, or Try Not
Books and Wine
Emergent Properties
God and Beauty
Moving Mount Fuji
The Nest
Rock 'n Roll
IQ and Populations
Are You a Bright?
Adding Value
The Joy of Craftsmanship
The Emperor's New Code
Toy Story
The Return of the King
Religion vs IQ
In the Wet
solving bongard problems
visiting Titan
unintelligent design
the nuclear option
estimating in meatspace
second gear
On the Persistence of Bad Design...
Texas chili cookoff
almost famous design and stochastic debugging
may I take your order?
universal healthcare
triple double
New Yorker covers
Death Rider! (da da dum)
how did I get here (Mt.Whitney)?
the Law of Significance
Holiday Inn
Daniel Jacoby's photographs
the first bird
Gödel Escher Bach: Birthday Cantatatata
Father's Day (in pictures)
your cat for my car
Jobsnotes of note
world population map
no joy in Baker
vote smart
exact nonsense
introducing eyesFinder
to space
where are the desktop apps?
still the first bird
electoral fail
progress ratches
2020 explained