google snippets

Check which of your pages are in the Google supplemental index a.k.a Google hell

Tagged seo, google, robots.txt, noindex, follow  Languages 

Try this query to find pages that are in Google's supplemental index (trick invented by Bruce Clay, Inc):*

The query should list all of your pages that are in Google's supplemental index a.k.a Google hell. These pages lower your Google page rank, so you should tell Google not to bother indexing those pages. This can be done with robots.txt:

Disallow: /tags/*
Disallow: /archive/*

Another way of doing it is to add a meta tag:

<meta name="robots" content="noindex,follow"/>

This tells search engines to read the page but not index it.

And, be careful with what you put in robots.txt...

How to submit your sitemap to multiple search engines

Tagged seo, sitemap, google, search  Languages 

To submit your sitemap to search engines—at least Google, MSN and Yahoo support this feature—add this line to your robots.txt file:


This allows the search engine to find your sitemap when it visits your site, which means you don't have to manually register it with each search engine.

Scraping Google search results with Scrubyt and Ruby

Tagged web, scraping, google, scrubyt, ruby, gotcha, hpricot, todelete, obsolete  Languages ruby

Note that these instructions don't work with the latest Scrubyt version...

Scrubyt is a Ruby library that allows you to easily scrape the contents of any site.

First install Scrubyt:

$ sudo gem install mechanize hpricot parsetree ruby2ruby scrubyt

You also need to install ReadLine version 3.6.3:

sudo gem install -v 3.6.3 RubyInline

If you install the wrong RubyInline version or have multiple versions installed, you'll get the following error:

/usr/lib/ruby/1.8/rubygems.rb:207:in activate': can't activate RubyInline (= 3.6.3), already activated RubyInline-3.6.6] (Gem::Exception)
       from /usr/lib/ruby/1.8/rubygems.rb:225:in activate'
       from /usr/lib/ruby/1.8/rubygems.rb:224:in each'
       from /usr/lib/ruby/1.8/rubygems.rb:224:in activate'
       from /usr/lib/ruby/1.8/rubygems/custom_require.rb:32:in require'
       from t2:2

To fix it first uninstall the latest version, and keep only version 3.6.3:

sudo gem uninstall RubyInline

Select RubyGem to uninstall:
 1. RubyInline-3.6.3
 2. RubyInline-3.6.6
 3. All versions
> 2

Scraping Google search results

Then run this to Scrape the first two pages of the Google results for ruby:

require 'rubygems'
require 'scrubyt'

# See

# Create a learning extractor
data = Scrubyt::Extractor.define do
  fill_textfield 'q', 'ruby'
  # Teach Scrubyt what we want to retrieve
  # In this case we want Scruby to find all search results
  # and "Ruby Programming Language" happens to be the first 
  # link in the result list. Change "Ruby Programming Language" 
  # to whatever you want Scruby to find.
  link do
    name  "Ruby Programming Language"
    url   "href", :type => :attribute
  # Click next until we're on the second page.
  next_page "Next", :limit => 2

# Print out what Scruby found
puts data.to_xml 

puts "Your production scraper has been created: data_extractor_export.rb."

# Export the production version of the scraper

Learning Extractor vs Production extractor

Note that this example uses the Learning Extractor functionality of Scrubyt.

The production extractor is generated with the last line:


If you open the production extractor in an editor you'll see that it uses XPath queries to extract the content:

link("/html/body/div/div/div/h2", { :generalize => true }) do
    url("href", { :type => :attribute })

Finding the correct XPath

The learning mode is pretty good at finding the XPath of HTML elements, but if you have difficulties getting Scrubyt to extract exactly what you want, simply install Firebug and use the Inspect feature to select the item you want to extract the value from. Then right-click on it in the Firebug window and choose Copy XPath.

Note that there's a gotcha when copying the XPath of an element with Firebug. Firebug uses Firefox's internal and normalized DOM model, which might not match match the real-world HTML structure. For example the tbody tag is usually added by Firefox/Firebug, and should be removed if it isn't in the HTML.

Another option that I haven't tried myself is to use the XPather extension.

Using hpricot to find the XPath

If you're really having problems finding the right XPath of an element, you can also use HPricot to find it. In this example the code prints out the XPath to all table columns containing the text 51,999:

require 'rexml/document'
require 'hpricot'
require 'open-uri'

url = "http://xyz"

page = Hpricot(open(url,
    'User-Agent' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv: Gecko/20080201 Firefox/',
        'Referer'    => 'http://xyz'
        )) "//td:contains('51,992')" ).each do |row|
  puts row.xpath()

The output from the above snippet looks something like this:


Note that sometimes I find that hrpicot is easier to use than Scrubyt, so use what's best for you.

Miscellaneous problems

The following problem can be solved by following the instructions found here:

Your production scraper has been created: data_extractor_export.rb.
/var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in extend': wrong argument type Class (expected Module) (TypeError)
       from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in to_sexp'
       from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:93:in parse_tree_for_method'
       from /var/lib/gems/1.8/gems/ruby2ruby-1.1.6/lib/ruby2ruby.rb:1063:in to_sexp'

How to exclude your own traffic from Google Analytics reports and other JavaScript based analytics software

Tagged google, analytics, tracking, exclude, traffic  Languages ruby

Option 1: Changing your browser's user agent

Open the about:config page in Firefox by typing about:config in the address bar and pressing enter. Now change the general.useragent.extra.firefox setting to an easily identifiable string, for example the following:

Firefox/3.0 disable-tracking

Then in your code check that the user-agent string doesn't contain disable-tracking

<% if !request.user_agent.include?('disable-tracking') %>
<% end %>

Option 2:

Use one of Google Analytics native ways of excluding traffic from certain domains, IPs, user-agents or users having a specific browser cookie.

How to track user actions and custom events with Google Analytics and jQuery

Tagged jquery, google, analytics, track, click  Languages javascript

This is a customization of Rebecca Murphey's script:

$('a').each(function() {
    var $a = $(this);
    var href = $a.attr('href');
    if(typeof pageTracker == 'undefined') { return; }

    // Link is external
    if (href.match(/^http/) && !href.match(document.domain)) {
        $ {
            pageTracker._trackPageview('/external/' + href);
    } else {
        $ {
            pageTracker._trackPageview('/internal' + href);

Note that clicks are shown as page views in reports, so you should exclude them from all reports. A future version of Google Analytics will allow you to track events, such as mouse clicks, without affecting page view reporting, see this page on the new event tracking beta feature for more information.

How to optimize your MephistoBlog powered site's search engine ranking (SEO for MephistoBlog)

Tagged seo, mephistoblog, meta, google, search, keywords  Languages 

At Aktagon we use MephistoBlog as CMS, and I couldn't find any information on how to SEO optimize MephistoBlog on Google, so I'm sharing my notes here.

This tip shows you how to make your pages more search engine friendly.

First, add the title tag, plus the meta description and keywords tags to your layout's Liquid template , as shown here:

<meta name="description" content="{% if article %} {{ article.excerpt }}  {% else %} YOUR DEFAULT SITE DESCRIPTION {% endif %}" />
    <meta name="keywords" content="{% if article %} {% for tag in article.tags %}{{ tag }}, {% endfor %} {% endif %} YOUR DEFAULT KEYWORDS" />
    <title>{% if article %} {{ article.title }} &raquo; {{ site.title }} {% else %} {{ site.title }} &raquo; {{ site.subtitle }} {% endif %}</title>

Remember to update the default description and keywords in the meta tags' body.

Now, whenever you publish an article, simply add an excerpt and some tags to it. The excerpt is used as the meta description and the article's tags as the meta keywords, both make Google a bit happier, but the description is by far the more important.

How to automatically ping search engines when your sitemap has changed

Tagged sitemap, ruby, ping, search, google  Languages ruby

I prefer letting cron update sitemaps in the background, and at the end of the script I ping search engines to let them know it's been updated:

# Recreate sitemap goes here

# Let search engines know about the update
[ "",
  "" ].each do |url|
  open(url) do |f|
    if f.status[0] == "200"
      puts "Sitemap successfully submitted to #{url}"      
      puts "Failed to submit sitemap to #{url}"

More about sitemaps:

Tracking 404 and 500 with Google Analytics

Tagged 404, 500, google, analytics, track  Languages javascript

New Google Analytics

_gaq.push(['_trackEvent', 'HTTP status', '404', '/xxx/what-a-fish']);

Old Google Analytics

Tracking 404 and 500 errors with Google Analytics is documented here, but I tend to forget so I'm putting the information here:

// 404
pageTracker._trackPageview("/404.html?page=" + document.location.pathname + + "&from=" + document.referrer);

// 500
pageTracker._trackPageview("/500.html?page=" + document.location.pathname + + "&from=" + document.referrer);


In Rails use the response status code to track any HTTP errors:

<% if response.status != 200 %>
_gaq.push(['_trackEvent', 'HTTP status', '<%= response.status %>', '<%= request.fullpath %>']);
<% end %>

Google Maps Version 3 Example with Markers and InfoWindow

Tagged google-maps, maps, google  Languages html
<style media="screen" type="text/css">
  #map { width:960px; height:330px; }

<script type="text/javascript" src=""></script>

<div id="map"></div>

<script type="text/javascript">
  var map;
  var marker;
  var initialized = false;

  var infowindow = new google.maps.InfoWindow({
    content: '',
    //disableAutoPan: true // Not compatible with InfoWindows. They are cropped...

  // Triggered when map is loaded or moved
  var boundsChangedListener = function() {
    if(initialized == true) { return };

    initialized = true;


  function addMarkers() {
    var bounds = map.get_bounds();
    var southWest = bounds.getSouthWest();
    var northEast = bounds.getNorthEast();

    var lngSpan = northEast.lng() - southWest.lng();
    var latSpan = -;

    var icon = '/images/icons/xxx-club-16.gif';

    for (var i = 0; i < 10; i++) {
      var point = new google.maps.LatLng( + latSpan * Math.random(),
        southWest.lng() + lngSpan * Math.random()

      var marker = new google.maps.Marker({
        position: point, 
        map:      map, 
        icon:     icon, 
        title:    "Marker"


  function addMarker(marker) {
    google.maps.event.addListener(marker, 'mouseover', function() {
      marker.html = 'Marker xxx';, marker);

    google.maps.event.addListener(marker, 'mouseout', function() {

How to export a Google Doc spreadsheet to CSV

Tagged csv, json, google, spreadsheet  Languages 

It's no longer possible to export Google Spreadsheet's to CSV, so use JSON instead:<SPREADSHEET_KEY>/od6/public/values?alt=json

You can find the SPREADSHEET_KEY in the Google Doc's original URL.

You can also use the Google Drive API.