Parsing feeds with Ruby and the FeedTools gem
This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on...
The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”
Installing
$ sudo gem install feedtools
Fetching and parsing a feed
Easy...
require 'rubygems'
require 'feed_tools'
feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss')
puts feed.title
puts feed.link
puts feed.description
for item in feed.items
puts item.title
puts item.link
puts item.content
end
Feed autodiscovery
FeedTools finds the Slashdot feed for you.
puts FeedTools::Feed.open('http://www.slashdot.org').href
Helpers
FeedTools can also cleanup your dirty XML/HTML:
require 'feed_tools'
require 'feed_tools/helpers/feed_tools_helper'
FeedTools::HtmlHelper.tidy_html(html)
Database cache
FeedTools can also store the fetched feeds for you:
FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"
The schema contains all you need:
-- Example MySQL schema
CREATE TABLE cached_feeds (
id int(10) unsigned NOT NULL auto_increment,
href varchar(255) default NULL,
title varchar(255) default NULL,
link varchar(255) default NULL,
feed_data longtext default NULL,
feed_data_type varchar(20) default NULL,
http_headers text default NULL,
last_retrieved datetime default NULL,
time_to_live int(10) unsigned NULL,
serialized longtext default NULL,
PRIMARY KEY (id)
)
There's even a Rails migration file included.
Feed updater
There's also a feed updater tool that can fetch feeds in the background, but I haven't had time to look at it yet.
sudo gem install feedupdater
Character set/encoding bug
As always, there are bugs that you need to be aware of, Feedtools is no different. There's an encoding bug, FeedTools encodes everything to ISO-8859-1, instead UTF-8 which should be the default encoding.
To fix it use the following code:
ic = Iconv.new('ISO-8859-1', 'UTF-8')
feed.description = ic.iconv(feed.description)
You can also try this patch.
cd /usr/local/lib/ruby/gems/1.8/gems/
wget http://n0life.org/~julbouln/feedtools_encoding.patch
patch -p1 feedtools_encoding.patch
The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial
Time estimation
By default FeedTools will try to estimate when a feed item was published, if it's not available from the feed. This annoys me and will create weird publish dates, so usually it's a good idea to disable it with the timestamp_estimation_enabled option:
FeedTools.reset_configurations
FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = nil
FeedTools.configurations[:default_ttl] = 15.minutes
FeedTools.configurations[:timestamp_estimation_enabled] = false
Configuration options
To see a list of available configuration options run the following code:
pp FeedTools.configurations