Register now and start sharing your code snippets.

How to parse an RSS or Atom feed with Python and the Universal Feed Parser library

Python posted 2 months ago by christian

This example uses the Universal Feed Parser, one of the best and fastest parsers for Python.

Feed Parser is a lot faster than feed_tools for Ruby and it’s about as fast as the ROME Java library according to my simple benchmark.

Feed Parser uses less memory and about as much of the CPU as ROME , but this wasn’t tested with a long running process, so don’t take my word for it.

   1  import time
   2  import feedparser
   3  
   4  start = time.time()
   5  
   6  feeds = [
   7  	'http://..', 
   8  	'http://'
   9  ]
  10  
  11  for url in feeds:
  12    options = {
  13      'agent'   : '..',
  14      'etag'    : '..',
  15      'modified': feedparser._parse_date('Sat, 29 Oct 1994 19:43:31 GMT'),
  16      'referrer' : '..'
  17    }
  18  
  19    feed = feedparser.parse(url, **options)
  20  
  21    print len(feed.entries)
  22    print feed.feed.title.encode('utf-8')
  23  
  24  end = time.time()
  25  
  26  print 'fetch took %0.3f s' % (end-start)

Tagged universal, feed, parser, atom, rss, python

How to parse an RSS or Atom feed with the ROME Java library

Java posted 2 months ago by christian

This is a simple example of how to use the ROME library to parse feeds:

   1  import com.sun.syndication.io.*;
   2  import com.sun.syndication.feed.synd.*;
   3  import java.net.URL;
   4  import java.util.*;
   5  
   6  public class RomeParserTest {
   7  
   8  	public static void main(String args[]) {
   9  		try {
  10  			SyndFeedInput sfi = new SyndFeedInput();
  11  
  12  			String urls[] = {
  13  				"...", 
  14  				"..." 
  15  			};
  16  			
  17  			for(String url:urls) {
  18  				SyndFeed feed = sfi.build(new XmlReader(new URL(url)));
  19  
  20  				List entries = feed.getEntries();
  21  
  22  				System.out.println(feed.getTitle());			
  23  				System.out.println(entries.size());
  24  			}
  25  		} catch (Exception ex) {
  26  			throw new RuntimeException(ex);
  27  		}
  28  	}
  29  }

Tagged rome, java, atom, rss, feed, parse

How to use Ruby and SimpleRSS to parse RSS and Atom feeds

Ruby posted 5 months ago by christian

This script is an example of how to use the SimpleRSS gem to parse an RSS feed.

The script can easily be modified to support conditional gets. It also detects the feed’s character encoding and converts the feed to UTF -8.

   1  require 'iconv'
   2  require 'net/http'
   3  require 'net/https'
   4  require 'rubygems'
   5  require 'simple-rss'
   6  
   7  url = URI.parse('http://hbl.fi/rss.xml')
   8  
   9  http = Net::HTTP.new(url.host, url.port)
  10  
  11  http.open_timeout = http.read_timeout = 10  # Set open and read timeout to 10 seconds
  12  http.use_ssl = (url.scheme == "https")
  13  
  14  headers = {
  15    'User-Agent'          => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
  16    'If-Modified-Since'   => 'store in a database and set on each request',
  17    'If-None-Match'       => 'store in a database and set on each request'
  18  }
  19  
  20  response, body = http.get(url.path, headers)
  21  
  22  encoding = body.scan(
  23  /^<\?xml [^>]*encoding="([^\"]*)"[^>]*\?>/
  24  ).flatten.first
  25  
  26  if encoding.empty?
  27  	if response["Content-Type"] =~ /charset=([\w\d-]+)/
  28  		puts "Feed #{url} is #{encoding} according to Content-Type header"
  29  		encoding = $1.downcase
  30  	else
  31  		puts "Unable to detect content encoding for #{href}, using default."
  32  		encoding = "ISO-8859-1"
  33  	end
  34  else
  35  	puts "Feed #{url} is #{encoding} according to XML"
  36  end
  37  
  38  # Use 'UTF-8//IGNORE', if this throws an exception
  39  ic = Iconv.new('UTF-8', encoding)
  40  body = ic.iconv(body)
  41  
  42  feed = SimpleRSS.parse(body)
  43  
  44  for item in feed.items
  45    puts item.title
  46  end

Tagged rss, atom, parse, ruby, simplerss, encoding, utf-8

Parsing feeds with Ruby and the FeedTools gem

Ruby posted 5 months ago by christian

This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on…

The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”

Installing

   1  $ sudo gem install feedtools

Fetching and parsing a feed

Easy…

   1  require 'rubygems'
   2  require 'feed_tools'
   3  feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss')
   4  
   5  puts feed.title
   6  puts feed.link
   7  puts feed.description
   8  
   9  for item in feed.items
  10    puts item.title
  11    puts item.link
  12    puts item.content
  13  end

Feed autodiscovery

FeedTools finds the Slashdot feed for you.

   1  puts FeedTools::Feed.open('http://www.slashdot.org').href

Helpers

FeedTools can also cleanup your dirty XML /HTML:

   1  require 'feed_tools'
   2  require 'feed_tools/helpers/feed_tools_helper'
   3  
   4  FeedTools::HtmlHelper.tidy_html(html)

Database cache

FeedTools can also store the fetched feeds for you:

   1  FeedTools.configurations[:tidy_enabled] = false
   2  FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"

The schema contains all you need:

   1  -- Example MySQL schema
   2    CREATE TABLE `cached_feeds` (
   3      `id`              int(10) unsigned NOT NULL auto_increment,
   4      `href`            varchar(255) default NULL,
   5      `title`           varchar(255) default NULL,
   6      `link`            varchar(255) default NULL,
   7      `feed_data`       longtext default NULL,
   8      `feed_data_type`  varchar(20) default NULL,
   9      `http_headers`    text default NULL,
  10      `last_retrieved`  datetime default NULL,
  11      `time_to_live`    int(10) unsigned NULL,
  12      `serialized`       longtext default NULL,
  13      PRIMARY KEY  (`id`)
  14    )

There’s even a Rails migration file included.

Feed updater

There’s also a feed updater tool that can fetch feeds in the background, but I haven’t had time to look at it yet.

   1  sudo gem install feedupdater

Character set/encoding bug

As always, there are bugs that you need to be aware of, Feedtools is no different. There’s an encoding bug, FeedTools encodes everything to ISO -8859-1, instead UTF -8 which should be the default encoding.

To fix it use the following code:

   1  ic = Iconv.new('ISO-8859-1', 'UTF-8')
   2  feed.description = ic.iconv(feed.description)

You can also try this patch.

   1  cd /usr/local/lib/ruby/gems/1.8/gems/
   2  wget http://n0life.org/~julbouln/feedtools_encoding.patch
   3  patch -p1 feedtools_encoding.patch

The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial

Time estimation

By default FeedTools will try to estimate when a feed item was published, if it’s not available from the feed. This annoys me and will create weird publish dates, so usually it’s a good idea to disable it with the timestamp_estimation_enabled option:

   1  FeedTools.reset_configurations
   2  FeedTools.configurations[:tidy_enabled] = false
   3  FeedTools.configurations[:feed_cache] = nil
   4  FeedTools.configurations[:default_ttl]   = 15.minutes
   5  FeedTools.configurations[:timestamp_estimation_enabled] = false

Configuration options

To see a list of available configuration options run the following code:

   1  pp FeedTools.configurations

Tagged feedtools, rss, atom, parser, ruby, content encoding, utf-8, iso-8859-1

Valid RSS 2.0 Feed Template for Rails

HTML (Rails) posted 10 months ago by christian

Note that because of a bug the pubDate and lastBuildDate tags are displayed in lowercase on this site…

   1  <?xml version="1.0"?>
   2  <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
   3    <channel>
   4      <atom:link href="http://xxxxxxx" rel="self" type="application/rss+xml" />
   5      <title>Code Snippets - Aktagon</title>
   6      <link>http://snippets.aktagon.com/</link>
   7      <description>Share your code with the world. Allow others to review and comment.</description>
   8      <language>en-us</language>
   9      <pubDate><%= @snippets[0].created_at.rfc822 %></pubDate>
  10      <lastBuildDate><%= @snippets[0].created_at.rfc822 %></lastBuildDate>
  11      <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  12      <generator>Aktagon Snippets</generator>
  13   <% for snippet in @snippets %>
  14      <item>
  15        <title><![CDATA[<%= snippet.title %>]]></title>
  16        <link><%= snippet_url(snippet) %></link>
  17        <description><![CDATA[<%= snippet.rendered_body %>]]></description>
  18        <pubDate><%= @snippets[0].created_at.rfc822 %></pubDate>
  19        <guid><%= snippet_url(snippet) %></guid>
  20  	  <% for tag in snippet.tags%>
  21  		<category domain="http://snippets.aktagon.com/snippets"><![CDATA[<%= tag.name %>]]></category>
  22  	  <% end%>
  23      </item>
  24  <% end %>
  25    </channel>
  26  </rss>
  27  

We’ll it’s supposed to be valid, but the syntax highlighting seems to process link tag, so it’s not…

Tagged ruby, rails, rss2.0, feed, rss, template, example