Parsing feeds with Ruby and the FeedTools gem
This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on…
The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”
Installing
1 $ sudo gem install feedtools
Fetching and parsing a feed
Easy…
1 require 'rubygems' 2 require 'feed_tools' 3 feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss') 4 5 puts feed.title 6 puts feed.link 7 puts feed.description 8 9 for item in feed.items 10 puts item.title 11 puts item.link 12 puts item.content 13 end
Feed autodiscovery
FeedTools finds the Slashdot feed for you.
1 puts FeedTools::Feed.open('http://www.slashdot.org').href
Helpers
FeedTools can also cleanup your dirty XML /HTML:
1 require 'feed_tools' 2 require 'feed_tools/helpers/feed_tools_helper' 3 4 FeedTools::HtmlHelper.tidy_html(html)
Database cache
FeedTools can also store the fetched feeds for you:
1 FeedTools.configurations[:tidy_enabled] = false 2 FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"
The schema contains all you need:
1 -- Example MySQL schema 2 CREATE TABLE `cached_feeds` ( 3 `id` int(10) unsigned NOT NULL auto_increment, 4 `href` varchar(255) default NULL, 5 `title` varchar(255) default NULL, 6 `link` varchar(255) default NULL, 7 `feed_data` longtext default NULL, 8 `feed_data_type` varchar(20) default NULL, 9 `http_headers` text default NULL, 10 `last_retrieved` datetime default NULL, 11 `time_to_live` int(10) unsigned NULL, 12 `serialized` longtext default NULL, 13 PRIMARY KEY (`id`) 14 )
There’s even a Rails migration file included.
Feed updater
There’s also a feed updater tool that can fetch feeds in the background, but I haven’t had time to look at it yet.
1 sudo gem install feedupdater
Character set/encoding bug
As always, there are bugs that you need to be aware of, Feedtools is no different. There’s an encoding bug, FeedTools encodes everything to ISO -8859-1, instead UTF -8 which should be the default encoding.
To fix it use the following code:
1 ic = Iconv.new('ISO-8859-1', 'UTF-8') 2 feed.description = ic.iconv(feed.description)
You can also try this patch.
1 cd /usr/local/lib/ruby/gems/1.8/gems/ 2 wget http://n0life.org/~julbouln/feedtools_encoding.patch 3 patch -p1 feedtools_encoding.patch
The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial
Time estimation
By default FeedTools will try to estimate when a feed item was published, if it’s not available from the feed. This annoys me and will create weird publish dates, so usually it’s a good idea to disable it with the timestamp_estimation_enabled option:
1 FeedTools.reset_configurations 2 FeedTools.configurations[:tidy_enabled] = false 3 FeedTools.configurations[:feed_cache] = nil 4 FeedTools.configurations[:default_ttl] = 15.minutes 5 FeedTools.configurations[:timestamp_estimation_enabled] = false
Configuration options
To see a list of available configuration options run the following code:
1 pp FeedTools.configurations