Register now and start sharing your code snippets.
-->

How to use Ruby and SimpleRSS to parse RSS and Atom feeds

Ruby posted 8 months ago by christian

This script is an example of how to use the SimpleRSS gem to parse an RSS feed.

The script can easily be modified to support conditional gets. It also detects the feed’s character encoding and converts the feed to UTF -8.

   1  require 'iconv'
   2  require 'net/http'
   3  require 'net/https'
   4  require 'rubygems'
   5  require 'simple-rss'
   6  
   7  url = URI.parse('http://hbl.fi/rss.xml')
   8  
   9  http = Net::HTTP.new(url.host, url.port)
  10  
  11  http.open_timeout = http.read_timeout = 10  # Set open and read timeout to 10 seconds
  12  http.use_ssl = (url.scheme == "https")
  13  
  14  headers = {
  15    'User-Agent'          => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
  16    'If-Modified-Since'   => 'store in a database and set on each request',
  17    'If-None-Match'       => 'store in a database and set on each request'
  18  }
  19  
  20  response, body = http.get(url.path, headers)
  21  
  22  encoding = body.scan(
  23  /^<\?xml [^>]*encoding="([^\"]*)"[^>]*\?>/
  24  ).flatten.first
  25  
  26  if encoding.empty?
  27  	if response["Content-Type"] =~ /charset=([\w\d-]+)/
  28  		puts "Feed #{url} is #{encoding} according to Content-Type header"
  29  		encoding = $1.downcase
  30  	else
  31  		puts "Unable to detect content encoding for #{href}, using default."
  32  		encoding = "ISO-8859-1"
  33  	end
  34  else
  35  	puts "Feed #{url} is #{encoding} according to XML"
  36  end
  37  
  38  # Use 'UTF-8//IGNORE', if this throws an exception
  39  ic = Iconv.new('UTF-8', encoding)
  40  body = ic.iconv(body)
  41  
  42  feed = SimpleRSS.parse(body)
  43  
  44  for item in feed.items
  45    puts item.title
  46  end

Tagged rss, atom, parse, ruby, simplerss, encoding, utf-8

Example of how to fetch a URL with Net:HTTP and Ruby

Ruby posted 8 months ago by christian

   1  require 'net/http'
   2  require 'net/https'
   3  
   4  url = URI.parse('http://www.google.com/yo?query=yahoo')
   5  
   6  http = Net::HTTP.new(url.host, url.port)
   7  
   8  http.open_timeout = http.read_timeout = 10  # Set open and read timeout to 10 seconds
   9  http.use_ssl = (url.scheme == "https")
  10         
  11  headers = {
  12    'User-Agent'          => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
  13    'If-Modified-Since'   => '',
  14    'If-None-Match'       => ''
  15  }
  16  
  17  # Note to self, use request_uri not path: http://www.ruby-doc.org/core/classes/URI/HTTP.html#M004934
  18  response, body = http.get(url.request_uri, headers)
  19  
  20  puts response.code
  21  puts response.message
  22  
  23  response.each {|key, val| puts key + ' = ' + val}

Tagged net, http, ruby, example, headers

Recursively add files to ClearCase

Ruby posted 8 months ago by christian

This script adds all files in the current directory to ClearCase.

Save the following script as add_recursively.rb in the directory you want to add to ClearCase:

   1  %x{cleartool ls -view_only -r -s . > view_private_files.txt}
   2  
   3  lines = File.open('view_private_files.txt').readlines.collect{|line| %Q{"#{line.chomp}"} }
   4  
   5  # Work around command line length limit in Windows
   6  while lines.size > 0
   7    %x{cleardlg /addtosrc #{lines.slice!(0..100).join(' ')}}
   8  end

Next open a command line window and execute the script:

   1  cd clearcase_vob
   2  ruby add_recursively.rb

ClearCase sucks, use Mercurial or git instead…

Tagged add, recursive, clearcase, ruby, script

Using backgroundrb to execute tasks asynchronously in Rails

Ruby posted 8 months ago by christian

Draft…

Planning on using BackgroundDRB? Take a long look at the alternatives first

Ask yourself, do you really need a complex solution like BackgroundDRB? Most likely you don’t, so use a simple daemonized process instead, see this snippet about the daemons gem for more information.

Heck, even a simple Ruby script run by cron every 5 minutes will be more stable than BackgroundDRB and require less work.

Even if you really need to process a lot of data asynchronously in the background, I wouldn’t recommend BackgroundDRB, it’s riddled with bugs and unstable in production, so use the BJ plugin instead.

Anyway, continue reading if you want to use BackgroundDRB…

Installing the prerequisites:

   1  $ sudo gem install chronic packet 

Installing backgroundrb

   1  $ cd rails_project
   2  $ git clone git://gitorious.org/backgroundrb/mainline.git vendor/plugins/backgroundrb

You can also get the latest stable version from the Subversion repository:

   1  svn co http://svn.devjavu.com/backgroundrb/trunk  vendor/plugins/backgroundrb

Setup backgroundrb

   1  rake backgroundrb:setup

Create a worker

   1  ./script/generate worker feeds_worker

   1  class FeedsWorker < BackgrounDRb::MetaWorker
   2    set_worker_name :feeds_worker
   3    
   4    def create(args = nil)
   5      # this method is called, when worker is loaded for the first time
   6      logger.info "Created feeds worker"
   7    end
   8    
   9    def update(data)
  10      logger.info "Updating #{Feed.count} feeds."
  11      
  12      seconds = Benchmark.realtime do
  13        thread_pool.defer do
  14          Feed.update_all()
  15        end
  16      end
  17  
  18      logger.info "Update took #{'%.5f' % seconds}."
  19    end
  20  end

Starting backgroundrb

First configure backgroundrb by opening config/backgroundrb.yml in your editor:

   1  :backgroundrb:
   2    :ip: 0.0.0.0
   3  
   4  :development:
   5    :backgroundrb:
   6      :port: 11111     # use port 11111
   7      :log: foreground # foreground mode,print log messages on console
   8  
   9  :production:
  10    :backgroundrb:
  11      :port: 22222      # use port 22222

Next, start backgroundrb in development mode:

   1  ./script/backgroundrb -e development &

Call your worker

From the command line:

   1  $ script/console
   2  Loading development environment (Rails 2.0.2)
   3  >> MiddleMan.worker(:feeds_worker).update() 

When things go wrong

Asynchronous programming is complex, so expect bugs…

Rule #1 know who you’re calling.

If you give your MiddleMan the wrong name of your worker, he’ll just spit this crap at you:

   1  You have a nil object when you didn't expect it!
   2  The error occurred while evaluating nil.send_request
   3  /usr/local/lib/ruby/gems/1.8/gems/packet-0.1.5/lib/packet/packet_master.rb:44:in `ask_worker'
   4  /Users/christian/Documents/Projects/xxx/vendor/plugins/backgroundrb/server/lib/master_worker.rb:104:in `process_work'
   5  /Users/christian/Documents/Projects/xxx/vendor/plugins/backgroundrb/server/lib/master_worker.rb:35:in `receive_data'
   6  /usr/local/lib/ruby/gems/1.8/gems/packet-0.1.5/lib/packet/packet_parser.rb:29:in `call'
   7  /usr/local/lib/ruby/gems/1.8/gems/packet-0.1.5/lib/packet/packet_parser.rb:29:in `extract'
   8  /Users/christian/Documents/Projects/xxx/vendor/plugins/backgroundrb/server/lib/master_worker.rb:31:in `receive_data'

So for example this command would generate the above mentioned error:

   1  MiddleMan.worker(:illegal_worker).update() 

It’s always nice to see a cryptic error messages such as this, it really deserves an award.

Check for bugs and bug fixes

git mainline commits

Going to production

Starting the daemon:

   1  ./script/backgroundrb -e production start

Configuring your task to run periodically

The following example makes backgroundrb call the FeedsWorker’s update method once every 15 minutes:

   1  :production:
   2    :backgroundrb:
   3      :port: 22222      # use port 22222
   4      :lazy_load: true  # do not load models eagerly
   5      :debug_log: false # disable log workers and other logging
   6  # Cron based scheduling
   7  :schedules:
   8    :feeds_worker:
   9      :update:
  10        :trigger_args: * */15 * * * *
  11        :data: "Hello world"

At the time of writing, the cron scheduler seems to be broken, so I prefer hard-coding the interval in the worker’s create method:

   1  def create
   2             add_periodic_timer(15.minutes) { update }
   3           end

If using Vlad or Capistrano, it’s also a good idea to fix script/backgroundrb by changing these lines:

   1  pid_file = "#{RAILS_HOME}/../../shared/pids/backgroundrb_#{CONFIG_FILE[:backgroundrb][:port]}.pid"
   2  SERVER_LOGGER = "#{RAILS_HOME}/../../shared/log/backgroundrb_server_#{CONFIG_FILE[:backgroundrb][:port]}.log"

Resources

Backgroundrb homepage

Backgroundrb best practices

Backgroundrb scheduling

Debugging backgroundrb

Backroundrb’s README

topfunky’s messaging article

Tagged backgroundrb, rails, ruby, distributed, messaging

Detecting file/data encoding with Ruby and the chardet RubyGem

Ruby posted 8 months ago by christian

You can use the chardet gem to detect the charset of an arbitrary string.

Install the chardet gem by issuing the following command:

   1  $ sudo gem install chardet

Then in irb:

   1  require 'rubygems'
   2  require 'UniversalDetector'
   3  p UniversalDetector::chardet('Ascii text')
   4  p UniversalDetector::chardet('åäö')

The output from this example is:

   1  {"encoding"=>"ascii", "confidence"=>1.0}
   2  {"encoding"=>"utf-8", "confidence"=>0.87625}

For Python users there exists an identical library…

Tagged detect, charset, encoding, ruby, chardet