iso-8859-1 snippets

Parsing feeds with Ruby and the FeedTools gem

Tagged feedtools, rss, atom, parser, ruby, content encoding, utf-8, iso-8859-1  Languages ruby

This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on...

The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”

Installing

$ sudo gem install feedtools

Fetching and parsing a feed

Easy...

require 'rubygems'
require 'feed_tools'
feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss')

puts feed.title
puts feed.link
puts feed.description

for item in feed.items
  puts item.title
  puts item.link
  puts item.content
end

Feed autodiscovery

FeedTools finds the Slashdot feed for you.

puts FeedTools::Feed.open('http://www.slashdot.org').href

Helpers

FeedTools can also cleanup your dirty XML/HTML:

require 'feed_tools'
require 'feed_tools/helpers/feed_tools_helper'

FeedTools::HtmlHelper.tidy_html(html)

Database cache

FeedTools can also store the fetched feeds for you:

FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"

The schema contains all you need:

-- Example MySQL schema
  CREATE TABLE cached_feeds (
    id              int(10) unsigned NOT NULL auto_increment,
    href            varchar(255) default NULL,
    title           varchar(255) default NULL,
    link            varchar(255) default NULL,
    feed_data       longtext default NULL,
    feed_data_type  varchar(20) default NULL,
    http_headers    text default NULL,
    last_retrieved  datetime default NULL,
    time_to_live    int(10) unsigned NULL,
    serialized       longtext default NULL,
    PRIMARY KEY  (id)
  )

There's even a Rails migration file included.

Feed updater

There's also a feed updater tool that can fetch feeds in the background, but I haven't had time to look at it yet.

sudo gem install feedupdater

Character set/encoding bug

As always, there are bugs that you need to be aware of, Feedtools is no different. There's an encoding bug, FeedTools encodes everything to ISO-8859-1, instead UTF-8 which should be the default encoding.

To fix it use the following code:

ic = Iconv.new('ISO-8859-1', 'UTF-8')
feed.description = ic.iconv(feed.description)

You can also try this patch.

cd /usr/local/lib/ruby/gems/1.8/gems/
wget http://n0life.org/~julbouln/feedtools_encoding.patch
patch -p1 feedtools_encoding.patch

The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial

Time estimation

By default FeedTools will try to estimate when a feed item was published, if it's not available from the feed. This annoys me and will create weird publish dates, so usually it's a good idea to disable it with the timestamp_estimation_enabled option:

FeedTools.reset_configurations
FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = nil
FeedTools.configurations[:default_ttl]   = 15.minutes
FeedTools.configurations[:timestamp_estimation_enabled] = false

Configuration options

To see a list of available configuration options run the following code:

pp FeedTools.configurations

How to fix Internet Explorer and Firefox form encoding issues when posting data to a server having a different encoding

Tagged utf-8, iso-8859-1, internet explorer, firefox, encoding, form  Languages html

This snippet explains how to fix Internet Explorer and Firefox form encoding issues when posting data to a server having a different encoding than the source system.

This happens, for example, when you host a form on a server using ISO-8859-1 that posts data to a server using UTF-8.

The fix for Firefox (and Opera and other sensible browsers) is to use the accept-charset attribute:

<form ...  accept-charset="utf-8">

The fix for Internet Explorer is to use a hack:

<form ...  accept-charset="utf-8">
  <input type="hidden" name="enc" value="&#153;">
</form>

The hidden input field will make Internet Explorer understand that you want it to support UTF-8.