Re: [bitfolk] Preventing spidering of web applications

Top Page
Author: Jeremy Kitchen
Date:  
To: users
Subject: Re: [bitfolk] Preventing spidering of web applications

Reply to this message
gpg: Signature made Thu Jan 3 18:40:16 2013 UTC
gpg: using RSA key BEB441496300CC3D
gpg: Can't check signature: No public key
On Thu, Jan 03, 2013 at 06:24:54AM +0000, Andy Smith wrote:
> Hello,
>
> What's the list's preferred techniques for preventing spidering of a
> web application (in this case Mediawiki) by misguided web robots?
>
> robots.txt already in place, but they ignore that of course.
>
> Ideally Apache-based.
>
> I don't particularly care if they are still able to download the
> content or not, I just don't want them taking up every single
> process slot thus impacting non-abusive 'real' web clients. So a
> rate-limiting solution would be acceptable.


http://dominia.org/djao/limitipconn2.html

I implemented this at DreamHost 5 or 6 years ago to great effect and as
far as I know it's still in place. It was implemented for a different
reason, but it worked quite well.

There does seem to be a race condition with regard to a high rate of
incoming connections from a given IP which will cause them to get
dropped slightly lower than the normal threshhold, but I just set our
limit to 20 and it stopped the baddies and I think we got maybe
3 complaints about it ever.

-Jeremy