Register now and start sharing your code snippets.
-->

Parsing feeds with Ruby and the FeedTools gem

Ruby posted 8 months ago by christian

This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on…

The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”

Installing

   1  $ sudo gem install feedtools

Fetching and parsing a feed

Easy…

   1  require 'rubygems'
   2  require 'feed_tools'
   3  feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss')
   4  
   5  puts feed.title
   6  puts feed.link
   7  puts feed.description
   8  
   9  for item in feed.items
  10    puts item.title
  11    puts item.link
  12    puts item.content
  13  end

Feed autodiscovery

FeedTools finds the Slashdot feed for you.

   1  puts FeedTools::Feed.open('http://www.slashdot.org').href

Helpers

FeedTools can also cleanup your dirty XML /HTML:

   1  require 'feed_tools'
   2  require 'feed_tools/helpers/feed_tools_helper'
   3  
   4  FeedTools::HtmlHelper.tidy_html(html)

Database cache

FeedTools can also store the fetched feeds for you:

   1  FeedTools.configurations[:tidy_enabled] = false
   2  FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"

The schema contains all you need:

   1  -- Example MySQL schema
   2    CREATE TABLE `cached_feeds` (
   3      `id`              int(10) unsigned NOT NULL auto_increment,
   4      `href`            varchar(255) default NULL,
   5      `title`           varchar(255) default NULL,
   6      `link`            varchar(255) default NULL,
   7      `feed_data`       longtext default NULL,
   8      `feed_data_type`  varchar(20) default NULL,
   9      `http_headers`    text default NULL,
  10      `last_retrieved`  datetime default NULL,
  11      `time_to_live`    int(10) unsigned NULL,
  12      `serialized`       longtext default NULL,
  13      PRIMARY KEY  (`id`)
  14    )

There’s even a Rails migration file included.

Feed updater

There’s also a feed updater tool that can fetch feeds in the background, but I haven’t had time to look at it yet.

   1  sudo gem install feedupdater

Character set/encoding bug

As always, there are bugs that you need to be aware of, Feedtools is no different. There’s an encoding bug, FeedTools encodes everything to ISO -8859-1, instead UTF -8 which should be the default encoding.

To fix it use the following code:

   1  ic = Iconv.new('ISO-8859-1', 'UTF-8')
   2  feed.description = ic.iconv(feed.description)

You can also try this patch.

   1  cd /usr/local/lib/ruby/gems/1.8/gems/
   2  wget http://n0life.org/~julbouln/feedtools_encoding.patch
   3  patch -p1 feedtools_encoding.patch

The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial

Time estimation

By default FeedTools will try to estimate when a feed item was published, if it’s not available from the feed. This annoys me and will create weird publish dates, so usually it’s a good idea to disable it with the timestamp_estimation_enabled option:

   1  FeedTools.reset_configurations
   2  FeedTools.configurations[:tidy_enabled] = false
   3  FeedTools.configurations[:feed_cache] = nil
   4  FeedTools.configurations[:default_ttl]   = 15.minutes
   5  FeedTools.configurations[:timestamp_estimation_enabled] = false

Configuration options

To see a list of available configuration options run the following code:

   1  pp FeedTools.configurations

Tagged feedtools, rss, atom, parser, ruby, content encoding, utf-8, iso-8859-1