parse snippets

How to use Ruby and SimpleRSS to parse RSS and Atom feeds

Tagged rss, atom, parse, ruby, simplerss, encoding, utf-8  Languages ruby

This script is an example of how to use the SimpleRSS gem to parse an RSS feed.

The script can easily be modified to support conditional gets. It also detects the feed's character encoding and converts the feed to UTF-8.

require 'iconv'
require 'net/http'
require 'net/https'
require 'rubygems'
require 'simple-rss'

url = URI.parse('http://hbl.fi/rss.xml')

http = Net::HTTP.new(url.host, url.port)

http.open_timeout = http.read_timeout = 10  # Set open and read timeout to 10 seconds
http.use_ssl = (url.scheme == "https")

headers = {
  'User-Agent'          => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
  'If-Modified-Since'   => 'store in a database and set on each request',
  'If-None-Match'       => 'store in a database and set on each request'
}

response, body = http.get(url.path, headers)

encoding = body.scan(
/^<\?xml [^>]*encoding="([^\"]*)"[^>]*\?>/
).flatten.first

if encoding.empty?
    if response["Content-Type"] =~ /charset=([\w\d-]+)/
        puts "Feed #{url} is #{encoding} according to Content-Type header"
        encoding = $1.downcase
    else
        puts "Unable to detect content encoding for #{href}, using default."
        encoding = "ISO-8859-1"
    end
else
    puts "Feed #{url} is #{encoding} according to XML"
end

# Use 'UTF-8//IGNORE', if this throws an exception
ic = Iconv.new('UTF-8', encoding)
body = ic.iconv(body)

feed = SimpleRSS.parse(body)

for item in feed.items
  puts item.title
end

How to parse RSS/Atom feeds with the ROME Java library

Tagged rome, java, atom, rss, feed, parse  Languages java

This is a simple example of how to use the ROME library to parse feeds:

import com.sun.syndication.io.*;
import com.sun.syndication.feed.synd.*;
import java.net.URL;
import java.util.*;

public class RomeParserTest {

    public static void main(String args[]) {
        try {
            SyndFeedInput sfi = new SyndFeedInput();

            String urls[] = {
                "...", 
                "..." 
            };
            
            for(String url:urls) {
                SyndFeed feed = sfi.build(new XmlReader(new URL(url)));

                List entries = feed.getEntries();

                System.out.println(feed.getTitle());            
                System.out.println(entries.size());
            }
        } catch (Exception ex) {
            throw new RuntimeException(ex);
        }
    }
}

How to parse CSV data with Ruby

Tagged csv, parse, ruby, fastercsv, ccsv, csvscan, excelsior  Languages ruby

Ruby alternatives for parsing CSV files

  • Ruby String#split (slow)
  • Built-in CSV (ok, recommended)
  • ccsv (fast & recommended if you have control over CSV format)
  • CSVScan (fast & recommended if you have control over CSV format)
  • Excelsior (fast & recommended if you have control over CSV format)

CSV library benchmarks can be found here and here

Parsing with plain Ruby

filename = 'data.csv'
file = File.new(filename, 'r')

file.each_line("\n") do |row|
  columns = row.split(",")
  
  break if file.lineno > 10
end

This option has several problems...

Parsing with the built-in CSV library

require 'csv'

CSV.open('data.csv', 'r', ';') do |row|
  puts row
end
require 'csv'

CSV.foreach("changes.csv", quote_char: '"', col_sep: ';', row_sep: :auto, headers: true) do |row|
  puts row[0]
  puts row['xxx']
end

Parsing with the ccsv library

ccsv is hosted on GitHub.

require 'rubygems'
require 'ccsv'

Ccsv.foreach(file) do |values|
  puts values[0]
end

Parsing with the CSVScan library

CSVScan can be downloaded from here.

require "csvscan"

open("data.csv") do |io|
  CSVScan.scan(io)  do|row|
    puts row
  end
end

Parsing feeds with Ruby and rFeedParser

Tagged rfeedparser, ruby, rss, parse, feed  Languages ruby

rFeedParser is a Ruby version of the feedparser Python library, which is probably the best (not fastest) feed parser.

To install it follow the instruction on the project's GitHub page.

require 'rubygems'
require 'rfeedparser'
require 'benchmark'


seconds = Benchmark.realtime do

    body = File.read('example-feed.xml')
    
    for num in (1..500)
        feed = FeedParser.parse(body) # Can be URL, string, data.
    end
    
end

puts "#{seconds.round} elapsed."

rFeedParser has one problem. In my simple test it was ~3-4 times slower than feed-normalizer and feedparser.org.

How to parse OPML with Ruby

Tagged opml, xml, parse, ruby  Languages ruby

This example demonstrates how to parse OPML with Ruby.

First install the gem.

gem install opml

Then run this code:

require 'pp'
require 'rubygems'
require 'opml'

opml = Opml.new(File.read('opml.xml'))
pp opml

opml.outlines[0].attributes['xml_url']
opml.outlines[0].attributes['html_url']
opml.outlines[0].attributes['title']

How to fix "bad URI(is not URI?)"

Tagged uri, url, ruby, parse  Languages ruby

This URL contains special characters, which Ruby can't handle:

>> URI.parse 'http://www.yr.no/sted/Finland/Västra_Finland/Askainen/varsel.xml'
URI::InvalidURIError: bad URI(is not URI?): http://www.yr.no/sted/Finland/Västra_Finland/Askainen/varsel.xml
    from /usr/local/lib/ruby/1.8/uri/common.rb:436:in split'
    from /usr/local/lib/ruby/1.8/uri/common.rb:485:in parse'
    from (irb):5

Your browser can probably open the URL. To fix this error encode the URL before handing it to the parse method:

URI.parse(URI.encode('http://www.yr.no/sted/Finland/Västra_Finland/Askainen/varsel.xml'))
=> #<URI::HTTP:0x18bfb40 URL:http://www.yr.no/sted/Finland/V%C3%A4stra_Finland/Askainen/varsel.xml>

How to parse XML with Python's built-in ElementTree parser

Tagged elementtree, python, xml, parse  Languages python
from xml.etree.ElementTree import fromstring, tostring

namespace = 'https://xxx.com/xxx'
element = fromstring(xml)

device = element.find('.//{%s}Device' % namespace)
detail = device.find('.//{%s}Details' % namespace)
series = device.findall('.//{%s}Series' % namespace)

Watch out for namespaces...

How to parse request parameters with JavaScript

Tagged javascript, jquery, request, parameters, url, parse  Languages javascript

Code:

var Request = {
    parameter: function(name) {
      return this.parameters()[name];
    },
    parameters: function(uri) {
      var i, parameter, params, query, result;
      result = {};
      if (!uri) {
        uri = window.location.search;
      }
      if (uri.indexOf("?") === -1) {
        return {};
      }
      query = uri.slice(1);
      params = query.split("&");
      i = 0;
      while (i < params.length) {
        parameter = params[i].split("=");
        result[parameter[0]] = parameter[1];
        i++;
      }
      return result;
    }
  };

Examples:

// ?query=test
var query = Request.parameter('query');

var parameters = Request.parameters();
// This works too
var query = parameters.query;
// And this
var query = parameters['query'];

// Replacing a parameter is easy with jQuery
parameters Request.parameters();
// change sort order
parameters.order = 'new-world-order'
new_parameters = $.param(parameters)
url = window.location.pathname + "?" + new_parameters

How to parse RSS/Atom feeds with Scala and the Rome library

Tagged scala, feed, atom, rss, parse  Languages java

This snippet shows how to parse feeds with Scala and the Rome library:

import com.sun.syndication.io._
import com.sun.syndication.feed.synd._
import java.net.URL

object FeedParser {
  def main(args: Array[String]): Unit = {
    try {
      val sfi = new SyndFeedInput()

      val urls = List("http://hbl.fi/rss.xml")
      
      urls.foreach(url => {
        val feed = sfi.build(new XmlReader(new URL(url)))

        val entries = feed.getEntries()

        println(feed.getTitle())
        println(entries.size())
      })
    } catch {
      case e => throw new RuntimeException(e)
    }
    
  }
}

Also see: https://gist.github.com/585235/bf328d90d094305121cec0ba2a646ce0093fa654

How to parse XML feeds with jQuery

Tagged atom, rss, feed, parse, jquery, internet explorer  Languages javascript
$.ajax({
    type: 'GET',
    url: '/some/good/stuff.xml',
    dataType: 'xml',
    error: function(xhr) {
        alert('Failed to parse feed');
    },
    success: function(xml) {
        var channel = $('channel', xml).eq(0);
        var items = [];
        $('item', xml).each( function() {
            var item = {};
            item.title = $(this).find('title').eq(0).text();
            item.link = $(this).find('link').eq(0).text();
            item.description = $(this).find('description').eq(0).text();
            item.updated = $(this).find('pubDate').eq(0).text();
            item.id = $(this).find('guid').eq(0).text();
            items.push(item);
        });
        console.dir(items);
    }
});

Your friend Internet Explorer

For IE 6 and better (worse?) the feed must return the right content type, so make sure the response contains this header:

Content-type: text/xml

If this header is not set the jQuery Ajax error handler is called and the feed is not parsed.