Register now and start sharing your code snippets.

How to detect traffic from the most common search spiders with Ruby

Ruby posted about 1 month ago by christian
This snippet detects traffic from the following bots, which is enough for me:
  • Google – Googlebot/2.1 ( http://www.googlebot.com/bot.html)
  • Google Image – Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)
  • MSN Live – msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm)
  • Yahoo – Mozilla/5.0 (compatible; Yahoo! Slurp;)

The code (via):

   1  user_agent = request.user_agent.downcase
   2  @bot = [ 'msnbot', 'yahoo! slurp','googlebot' ].detect { |bot| user_agent.include? bot }

When the Google bot visists your site the @bot string will contain ‘googlebot’.

If you need to detect more bots than these, then the user-agents.org site contains a list of various user agents for both bots and browsers.

Tagged spider, web crawler, bot, search, user agent, detect

Detecting file/data encoding with Ruby and the chardet RubyGem

Ruby posted 5 months ago by christian

You can use the chardet gem to detect the charset of an arbitrary string.

Install the chardet gem by issuing the following command:

   1  $ sudo gem install chardet

Then in irb:

   1  require 'rubygems'
   2  require 'UniversalDetector'
   3  p UniversalDetector::chardet('Ascii text')
   4  p UniversalDetector::chardet('åäö')

The output from this example is:

   1  {"encoding"=>"ascii", "confidence"=>1.0}
   2  {"encoding"=>"utf-8", "confidence"=>0.87625}

For Python users there exists an identical library…

Tagged detect, charset, encoding, ruby, chardet