How to use Ruby and SimpleRSS to parse RSS and Atom feeds
This script is an example of how to use the SimpleRSS gem to parse an RSS feed.
The script can easily be modified to support conditional gets. It also detects the feed’s character encoding and converts the feed to UTF -8.
1 require 'iconv' 2 require 'net/http' 3 require 'net/https' 4 require 'rubygems' 5 require 'simple-rss' 6 7 url = URI.parse('http://hbl.fi/rss.xml') 8 9 http = Net::HTTP.new(url.host, url.port) 10 11 http.open_timeout = http.read_timeout = 10 # Set open and read timeout to 10 seconds 12 http.use_ssl = (url.scheme == "https") 13 14 headers = { 15 'User-Agent' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12', 16 'If-Modified-Since' => 'store in a database and set on each request', 17 'If-None-Match' => 'store in a database and set on each request' 18 } 19 20 response, body = http.get(url.path, headers) 21 22 encoding = body.scan( 23 /^<\?xml [^>]*encoding="([^\"]*)"[^>]*\?>/ 24 ).flatten.first 25 26 if encoding.empty? 27 if response["Content-Type"] =~ /charset=([\w\d-]+)/ 28 puts "Feed #{url} is #{encoding} according to Content-Type header" 29 encoding = $1.downcase 30 else 31 puts "Unable to detect content encoding for #{href}, using default." 32 encoding = "ISO-8859-1" 33 end 34 else 35 puts "Feed #{url} is #{encoding} according to XML" 36 end 37 38 # Use 'UTF-8//IGNORE', if this throws an exception 39 ic = Iconv.new('UTF-8', encoding) 40 body = ic.iconv(body) 41 42 feed = SimpleRSS.parse(body) 43 44 for item in feed.items 45 puts item.title 46 end
Parsing feeds with Ruby and the FeedTools gem
This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on…
The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”
Installing
1 $ sudo gem install feedtools
Fetching and parsing a feed
Easy…
1 require 'rubygems' 2 require 'feed_tools' 3 feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss') 4 5 puts feed.title 6 puts feed.link 7 puts feed.description 8 9 for item in feed.items 10 puts item.title 11 puts item.link 12 puts item.content 13 end
Feed autodiscovery
FeedTools finds the Slashdot feed for you.
1 puts FeedTools::Feed.open('http://www.slashdot.org').href
Helpers
FeedTools can also cleanup your dirty XML /HTML:
1 require 'feed_tools' 2 require 'feed_tools/helpers/feed_tools_helper' 3 4 FeedTools::HtmlHelper.tidy_html(html)
Database cache
FeedTools can also store the fetched feeds for you:
1 FeedTools.configurations[:tidy_enabled] = false 2 FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"
The schema contains all you need:
1 -- Example MySQL schema 2 CREATE TABLE `cached_feeds` ( 3 `id` int(10) unsigned NOT NULL auto_increment, 4 `href` varchar(255) default NULL, 5 `title` varchar(255) default NULL, 6 `link` varchar(255) default NULL, 7 `feed_data` longtext default NULL, 8 `feed_data_type` varchar(20) default NULL, 9 `http_headers` text default NULL, 10 `last_retrieved` datetime default NULL, 11 `time_to_live` int(10) unsigned NULL, 12 `serialized` longtext default NULL, 13 PRIMARY KEY (`id`) 14 )
There’s even a Rails migration file included.
Feed updater
There’s also a feed updater tool that can fetch feeds in the background, but I haven’t had time to look at it yet.
1 sudo gem install feedupdater
Character set/encoding bug
As always, there are bugs that you need to be aware of, Feedtools is no different. There’s an encoding bug, FeedTools encodes everything to ISO -8859-1, instead UTF -8 which should be the default encoding.
To fix it use the following code:
1 ic = Iconv.new('ISO-8859-1', 'UTF-8') 2 feed.description = ic.iconv(feed.description)
You can also try this patch.
1 cd /usr/local/lib/ruby/gems/1.8/gems/ 2 wget http://n0life.org/~julbouln/feedtools_encoding.patch 3 patch -p1 feedtools_encoding.patch
The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial
Time estimation
By default FeedTools will try to estimate when a feed item was published, if it’s not available from the feed. This annoys me and will create weird publish dates, so usually it’s a good idea to disable it with the timestamp_estimation_enabled option:
1 FeedTools.reset_configurations 2 FeedTools.configurations[:tidy_enabled] = false 3 FeedTools.configurations[:feed_cache] = nil 4 FeedTools.configurations[:default_ttl] = 15.minutes 5 FeedTools.configurations[:timestamp_estimation_enabled] = false
Configuration options
To see a list of available configuration options run the following code:
1 pp FeedTools.configurations
How to read UTF-8 data from an Oracle BLOB column with Java and JDBC
This example works with Oracle:
1 private String getBlobAsString(Blob blob) 2 { 3 StringBuffer result = new StringBuffer(); 4 5 if ( blob != null ) 6 { 7 int read = 0; 8 Reader reader = null; 9 char[] buffer = new char[1024]; 10 11 try 12 { 13 reader = new InputStreamReader(blob.getBinaryStream(), "UTF-8"); 14 15 while((read = reader.read(buffer)) != -1) 16 { 17 result.append(buffer, 0, read); 18 } 19 } 20 catch(SQLException ex) 21 { 22 throw new RuntimeException("Unable to read blob data.", ex); 23 } 24 catch(IOException ex) 25 { 26 throw new RuntimeException("Unable to read blob data.", ex); 27 } 28 finally 29 { 30 try { if(reader != null) reader.close(); } catch(Exception ex) {}; 31 } 32 } 33 34 return result.toString(); 35 }
Then use the method like this:
1 ResultSet resultSet = your JDBC result set; 2 3 String utf8 = getBlobAsString(resultSet.getBlob("xml")); 4
How to write UTF-8 data to an Oracle BLOB column with Java and JDBC
This example works with Oracle:
1 private Blob getBlob(Connection connection, String data) 2 { 3 BLOB blob = BLOB.createTemporary(connection, true, BLOB.DURATION_SESSION); 4 5 try 6 { 7 blob.open(BLOB.MODE_READWRITE); 8 blob.putBytes(1, data.getBytes("UTF-8")); // Consider streaming, if data size is unknown. Note that setBytes doesn't work 9 } 10 catch(UnsupportedEncodingException ex) 11 { 12 throw new RuntimeException("Unable to get a blob for '" + data + "'", ex); 13 } 14 catch(SQLException ex) 15 { 16 throw new RuntimeException("Unable to get a blob for '" + data + "'", ex); 17 } 18 finally 19 { 20 try { if(blob != null) blob.close(); } catch(Exception ex) {}; 21 } 22 }
Then use the method like this:
1 Connection connection = getConnection(); 2 PreparedStatement statement = getPreparedStatement(yer sequel); 3 4 statement.setBlob(1, getBlob(connection, <Mao's Little Red Book>));