Perl script that can be used to calculate min, max, mean, mode, median and standard deviation for a set of log records
The best thing about this script is that it’s easy to customize, right now it’s optimized for comma delimited data.
1 use strict; 2 use warnings; 3 4 # Import stdev, average, mean and other statistical functions 5 # A copy of http://search.cpan.org/~brianl/Statistics-Lite-3.2/Lite.pm 6 do('stats.pl'); 7 8 my %page_runtimes; 9 my $delimitor = ';'; 10 my @columns = ("page", "samples", "min", "max", "mean", "mode", "median", "stddev\n"); 11 my $line; 12 my $first_timestamp, my $last_timestamp; 13 14 # ========================================== 15 # Parse log file 16 # ========================================== 17 foreach $line (<>) { 18 # remove the newline from $line, otherwise the report will be corrupted. 19 chomp($line); 20 21 my @columns = split(';', $line); 22 my $timestamp = $columns[0]; 23 my $page_name = $columns[1]; 24 my $page_runtime = $columns[2]; 25 26 if(!defined($first_timestamp)) 27 { 28 $first_timestamp = $timestamp; 29 } 30 31 # print what we find 32 if(!defined(@{$page_runtimes{$page_name}})) 33 { 34 print "Found page '$page_name'\n"; 35 } 36 37 # add page runtimes to one hash 38 push(@{$page_runtimes{$page_name}}, $page_runtime); 39 40 $last_timestamp = $timestamp; 41 } 42 43 # ========================================== 44 # Calculate and print page statistics 45 # ========================================== 46 open(PAGE_REPORT, ">report.csv") or die("Could not open report.csv."); 47 48 print PAGE_REPORT "First sample\n".$first_timestamp."\nLast sample\n".$last_timestamp."\n\n"; 49 print PAGE_REPORT join($delimitor, @columns); 50 51 for my $page_name (keys %page_runtimes ) 52 { 53 my @runtimes = @{$page_runtimes{$page_name}}; 54 55 my $samples = @runtimes; 56 my $min = min(@runtimes); 57 my $max = max(@runtimes); 58 my $mean = mean(@runtimes); 59 my $mode = mode(@runtimes); 60 my $median = median(@runtimes); 61 my $stddev = stddev(@runtimes); 62 63 my @data = ($page_name, $samples, $min, $max, $mean, $mode, $median, $stddev); 64 65 my $line = join($delimitor, @data); 66 67 # Use comma instead of decimal 68 $line =~ s/\./\,/g; 69 70 print PAGE_REPORT "$line\n"; 71 } 72 close(PAGE_REPORT);
To use it simply pipe some data into it like this:
1 grep "2008-31-12" silly-data.log | perl analyze.pl
How to parse an RSS or Atom feed with Python and the Universal Feed Parser library
This example uses the Universal Feed Parser, one of the best and fastest parsers for Python.
Feed Parser is a lot faster than feed_tools for Ruby and it’s about as fast as the ROME Java library according to my simple benchmark.
Feed Parser uses less memory and about as much of the CPU as ROME , but this wasn’t tested with a long running process, so don’t take my word for it.
1 import time 2 import feedparser 3 4 start = time.time() 5 6 feeds = [ 7 'http://..', 8 'http://' 9 ] 10 11 for url in feeds: 12 options = { 13 'agent' : '..', 14 'etag' : '..', 15 'modified': feedparser._parse_date('Sat, 29 Oct 1994 19:43:31 GMT'), 16 'referrer' : '..' 17 } 18 19 feed = feedparser.parse(url, **options) 20 21 print len(feed.entries) 22 print feed.feed.title.encode('utf-8') 23 24 end = time.time() 25 26 print 'fetch took %0.3f s' % (end-start)
Parsing feeds with Ruby and the FeedTools gem
This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on…
The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”
Installing
1 $ sudo gem install feedtools
Fetching and parsing a feed
Easy…
1 require 'rubygems' 2 require 'feed_tools' 3 feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss') 4 5 puts feed.title 6 puts feed.link 7 puts feed.description 8 9 for item in feed.items 10 puts item.title 11 puts item.link 12 puts item.content 13 end
Feed autodiscovery
FeedTools finds the Slashdot feed for you.
1 puts FeedTools::Feed.open('http://www.slashdot.org').href
Helpers
FeedTools can also cleanup your dirty XML /HTML:
1 require 'feed_tools' 2 require 'feed_tools/helpers/feed_tools_helper' 3 4 FeedTools::HtmlHelper.tidy_html(html)
Database cache
FeedTools can also store the fetched feeds for you:
1 FeedTools.configurations[:tidy_enabled] = false 2 FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"
The schema contains all you need:
1 -- Example MySQL schema 2 CREATE TABLE `cached_feeds` ( 3 `id` int(10) unsigned NOT NULL auto_increment, 4 `href` varchar(255) default NULL, 5 `title` varchar(255) default NULL, 6 `link` varchar(255) default NULL, 7 `feed_data` longtext default NULL, 8 `feed_data_type` varchar(20) default NULL, 9 `http_headers` text default NULL, 10 `last_retrieved` datetime default NULL, 11 `time_to_live` int(10) unsigned NULL, 12 `serialized` longtext default NULL, 13 PRIMARY KEY (`id`) 14 )
There’s even a Rails migration file included.
Feed updater
There’s also a feed updater tool that can fetch feeds in the background, but I haven’t had time to look at it yet.
1 sudo gem install feedupdater
Character set/encoding bug
As always, there are bugs that you need to be aware of, Feedtools is no different. There’s an encoding bug, FeedTools encodes everything to ISO -8859-1, instead UTF -8 which should be the default encoding.
To fix it use the following code:
1 ic = Iconv.new('ISO-8859-1', 'UTF-8') 2 feed.description = ic.iconv(feed.description)
You can also try this patch.
1 cd /usr/local/lib/ruby/gems/1.8/gems/ 2 wget http://n0life.org/~julbouln/feedtools_encoding.patch 3 patch -p1 feedtools_encoding.patch
The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial
Time estimation
By default FeedTools will try to estimate when a feed item was published, if it’s not available from the feed. This annoys me and will create weird publish dates, so usually it’s a good idea to disable it with the timestamp_estimation_enabled option:
1 FeedTools.reset_configurations 2 FeedTools.configurations[:tidy_enabled] = false 3 FeedTools.configurations[:feed_cache] = nil 4 FeedTools.configurations[:default_ttl] = 15.minutes 5 FeedTools.configurations[:timestamp_estimation_enabled] = false
Configuration options
To see a list of available configuration options run the following code:
1 pp FeedTools.configurations
A simple and easy to use PHP XML parser
The PHP XML parser:
1 class XML 2 { 3 static function parse($data, $handler, $encoding = "UTF-8") 4 { 5 $parser = xml_parser_create($encoding); 6 7 xml_set_object($parser, $handler); 8 9 xml_set_element_handler($parser, 10 array(&$handler, 'start'), 11 array(&$handler, 'end') 12 ); 13 14 xml_set_character_data_handler( 15 $parser, 16 array($handler, 'content') 17 ); 18 19 $result = xml_parse($parser, $data); 20 21 if(!$result) 22 { 23 $error_string = xml_error_string(xml_get_error_code($parser)); 24 $error_line = xml_get_current_line_number($parser); 25 $error_column = xml_get_current_column_number($parser); 26 27 $message = sprintf("XML error '%s' at line %d column %d", $error_string, $error_line, $error_column); 28 29 throw new Exception($message); 30 } 31 32 xml_parser_free($parser); 33 } 34 }
A result handler:
1 class ResultHandler 2 { 3 var $tag; 4 5 function start ($parser, $tagName, $attributes = null) 6 { 7 echo "start"; 8 $this->tag = $tagName; 9 } 10 11 function end ($parser, $tagName) 12 { 13 echo "end"; 14 $this->tag = null; 15 16 } 17 18 function content ($parser, $content) 19 { 20 echo "$this->tag: $content" ; 21 } 22 }
Then in your code:
1 $xml = "<a>bah</a>"; 2 XML::parse($xml, new ResultHandler());