parser snippets

A simple and easy to use PHP XML parser

Tagged php, xml, parser, simple  Languages php

The PHP XML parser:

class XML
{
    static function parse($data, $handler, $encoding = "UTF-8")
    {
        $parser = xml_parser_create($encoding);

        xml_set_object($parser, $handler);
        
        xml_set_element_handler($parser,
            array(&$handler, 'start'),
            array(&$handler, 'end')
        );
            
        xml_set_character_data_handler(
            $parser,
            array($handler, 'content')
        );
            
        $result = xml_parse($parser, $data);

        if(!$result)
        {
            $error_string = xml_error_string(xml_get_error_code($parser));
            $error_line   = xml_get_current_line_number($parser);
            $error_column = xml_get_current_column_number($parser);
            
            $message = sprintf("XML error '%s' at line %d column %d", $error_string, $error_line, $error_column);
            
            throw new Exception($message);
        }

        xml_parser_free($parser);
    }
}

A result handler:

class ResultHandler
{
    var $tag;

    function start ($parser, $tagName, $attributes = null)
    {
        echo "start";
        $this->tag .= $tagName; # Use .= to work around bug...
    }

    function end ($parser, $tagName)
    {
        echo "end";
        $this->tag = null;

    }

    function content ($parser, $content)
    {
        echo "$this->tag: $content" ;
    }
}

Then in your code:

$xml = "<a>bah</a>";
XML::parse($xml, new ResultHandler());

Note that HTML/XML entities are considered to be tags by PHP's XML parser, so your start tag handler will be called three times for this tag, once for "really", once for "&" and once for " bad parser":

<data>really &amp;  bad parser</data>

I guess this is a bug... You can

Parsing feeds with Ruby and the FeedTools gem

Tagged feedtools, rss, atom, parser, ruby, content encoding, utf-8, iso-8859-1  Languages ruby

This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on...

The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”

Installing

$ sudo gem install feedtools

Fetching and parsing a feed

Easy...

require 'rubygems'
require 'feed_tools'
feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss')

puts feed.title
puts feed.link
puts feed.description

for item in feed.items
  puts item.title
  puts item.link
  puts item.content
end

Feed autodiscovery

FeedTools finds the Slashdot feed for you.

puts FeedTools::Feed.open('http://www.slashdot.org').href

Helpers

FeedTools can also cleanup your dirty XML/HTML:

require 'feed_tools'
require 'feed_tools/helpers/feed_tools_helper'

FeedTools::HtmlHelper.tidy_html(html)

Database cache

FeedTools can also store the fetched feeds for you:

FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"

The schema contains all you need:

-- Example MySQL schema
  CREATE TABLE cached_feeds (
    id              int(10) unsigned NOT NULL auto_increment,
    href            varchar(255) default NULL,
    title           varchar(255) default NULL,
    link            varchar(255) default NULL,
    feed_data       longtext default NULL,
    feed_data_type  varchar(20) default NULL,
    http_headers    text default NULL,
    last_retrieved  datetime default NULL,
    time_to_live    int(10) unsigned NULL,
    serialized       longtext default NULL,
    PRIMARY KEY  (id)
  )

There's even a Rails migration file included.

Feed updater

There's also a feed updater tool that can fetch feeds in the background, but I haven't had time to look at it yet.

sudo gem install feedupdater

Character set/encoding bug

As always, there are bugs that you need to be aware of, Feedtools is no different. There's an encoding bug, FeedTools encodes everything to ISO-8859-1, instead UTF-8 which should be the default encoding.

To fix it use the following code:

ic = Iconv.new('ISO-8859-1', 'UTF-8')
feed.description = ic.iconv(feed.description)

You can also try this patch.

cd /usr/local/lib/ruby/gems/1.8/gems/
wget http://n0life.org/~julbouln/feedtools_encoding.patch
patch -p1 feedtools_encoding.patch

The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial

Time estimation

By default FeedTools will try to estimate when a feed item was published, if it's not available from the feed. This annoys me and will create weird publish dates, so usually it's a good idea to disable it with the timestamp_estimation_enabled option:

FeedTools.reset_configurations
FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = nil
FeedTools.configurations[:default_ttl]   = 15.minutes
FeedTools.configurations[:timestamp_estimation_enabled] = false

Configuration options

To see a list of available configuration options run the following code:

pp FeedTools.configurations

How to parse an RSS or Atom feed with Python and the Universal Feed Parser library

Tagged universal, feed, parser, atom, rss, python  Languages python

This example uses the Universal Feed Parser, one of the best and fastest parsers for Python.

Feed Parser is a lot faster than feed_tools for Ruby and it's about as fast as the ROME Java library according to my simple benchmark.

Feed Parser uses less memory and about as much of the CPU as ROME, but this wasn't tested with a long running process, so don't take my word for it.

import time
import feedparser

start = time.time()

feeds = [
    'http://..', 
    'http://'
]

for url in feeds:
  options = {
    'agent'   : '..',
    'etag'    : '..',
    'modified': feedparser._parse_date('Sat, 29 Oct 1994 19:43:31 GMT'),
    'referrer' : '..'
  }

  feed = feedparser.parse(url, **options)

  print len(feed.entries)
  print feed.feed.title.encode('utf-8')

end = time.time()

print 'fetch took %0.3f s' % (end-start)

Perl script that can be used to calculate min, max, mean, mode, median and standard deviation for a set of log records

Tagged csv, perl, min, max, mean, log, parser  Languages perl

The best thing about this script is that it's easy to customize, right now it's optimized for comma delimited data.

use strict;
use warnings;

# Import stdev, average, mean and other statistical functions
# A copy of http://search.cpan.org/~brianl/Statistics-Lite-3.2/Lite.pm
do('stats.pl');

my %page_runtimes;
my $delimitor = ';';
my @columns = ("page", "samples", "min", "max", "mean", "mode", "median", "stddev\n");
my $line;
my $first_timestamp, my $last_timestamp;

# ==========================================
# Parse log file
# ==========================================

#
# Don't use foreach as it reads the whole file into memory: foreach $line (<>) { 
#
while ($line=<>) {
  # remove the newline from $line, otherwise the report will be corrupted.
  chomp($line);

  my @columns               = split(';', $line);
  my $timestamp             = $columns[0];
  my $page_name             = $columns[1];
  my $page_runtime          = $columns[2];

  if(!defined($first_timestamp))
  {
    $first_timestamp = $timestamp;
  }

  # print what we find
  if(!defined(@{$page_runtimes{$page_name}}))
  {
    print "Found page '$page_name'\n";
  }
 
  # add page runtimes to one hash
  push(@{$page_runtimes{$page_name}}, $page_runtime);
 
  $last_timestamp = $timestamp;
}

# ==========================================
# Calculate and print page statistics
# ==========================================
open(PAGE_REPORT, ">report.csv") or die("Could not open report.csv.");

print PAGE_REPORT "First sample\n".$first_timestamp."\nLast sample\n".$last_timestamp."\n\n";
print PAGE_REPORT join($delimitor, @columns);

for my $page_name (keys %page_runtimes )
{
  my @runtimes = @{$page_runtimes{$page_name}};
 
  my $samples = @runtimes;
  my $min     = min(@runtimes);
  my $max     = max(@runtimes);
  my $mean    = mean(@runtimes);
  my $mode    = mode(@runtimes);
  my $median  = median(@runtimes);
  my $stddev  = stddev(@runtimes);
 
  my @data = ($page_name, $samples, $min, $max, $mean, $mode, $median, $stddev);
 
  my $line = join($delimitor, @data);
 
  # Use comma instead of decimal
  $line =~ s/\./\,/g;
 
  print PAGE_REPORT "$line\n";
}
close(PAGE_REPORT);

To use it simply pipe some data into it like this:

grep "2008-31-12" silly-data.log | perl analyze.pl

How to parse an XML document in Go

Tagged go, xml, encoding, golang, rss, parser  Languages go

This example shows how to fetch and parse an XML feed with Go.

Save this in main_test.go:

package main

import (
    "bytes"
    "code.google.com/p/go-charset/charset"
    _ "code.google.com/p/go-charset/data" // Import charset configuration files
    "encoding/xml"
    "io/ioutil"
    "log"
    "net/http"
    "testing"
)

type RssFeed struct {
    XMLName xml.Name  `xml:"rss"`
    Items   []RssItem `xml:"channel>item"`
}

type RssItem struct {
    XMLName     xml.Name `xml:"item"`
    Title       string   `xml:"title"`
    Link        string   `xml:"link"`
    Description string   `xml:"description"`
    //NestedTag    string      xml:">nested>tags>"
}

func fetchURL(url string) []byte {
    resp, err := http.Get(url)
    if err != nil {
        log.Fatalf("unable to GET '%s': %s", url, err)
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Fatalf("unable to read body '%s': %s", url, err)
    }
    return body
}

func parseXML(xmlDoc []byte, target interface{}) {
    reader := bytes.NewReader(xmlDoc)
    decoder := xml.NewDecoder(reader)
    // Fixes "xml: encoding \"windows-1252\" declared but Decoder.CharsetReader is nil"
    decoder.CharsetReader = charset.NewReader
    if err := decoder.Decode(target); err != nil {
        log.Fatalf("unable to parse XML '%s':\n%s", err, xmlDoc)
    }
}

func TestParseReport(t *testing.T) {
    var rssFeed = &RssFeed{}
    xmlDoc := fetchURL("https://news.ycombinator.com/rss")
    parseXML(xmlDoc, &rssFeed)
    for _, item := range rssFeed.Items {
        log.Printf("%s: %s", item.Title, item.Link)
    }
}

Run the code with go test.