www snippets

How to improve your PageRank with 301 permanent redirects when using Nginx

Tagged seo, www, 301, permanent, nginx, redirect, rewrite, module  Languages 

Mathew Innman of seomoz.org fame wrote about how Digg could increase their revenue by using a so called canonical URL for their whole site. This can be implemented by redirecting users that type in, for example, www.digg.com to digg.com. The reasoning being that instead of having backlinks pointing to two different domains (www and no-www), all backlinks should point to just one, which increases your search engine ranking.

if ($host ~* "www") {
      rewrite ^(.*)$ http://aktagon.com$1 permanent;

Permanent redirects are also a good idea, if you move your content to a new domain--digg.com to dugg.com, for example...

The syntax for the Nginx rewrite module is documented here.

Using the WWW::Mechanize RubyGem to scrape login protected pages

Tagged www, mechanize, scraping, scrape, login, ruby  Languages ruby

This is an example of how to access a login protected site with WWW::Mechanize. In this example, the login form has two fields named user and password. In other words, the HTML contains the following code:

<input name="user" .../>
<input name="password" .../>

Note that this example also shows how to enable WWW::Mechanize logging and how to capture the HTML response:

require 'rubygems'
require 'logger'
require 'mechanize'

agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
#agent.set_proxy('a-proxy', '8080')
page = agent.get 'http://bobthebuilder.com'

form = page.forms.first
form.user = 'bob'
form.password = 'password'

page = agent.submit form

output = File.open("output.html", "w") { |file|  file << page.body }

Use the search method to scrape the page content. In this example I extract all text contained by span elements, which in turn are contained by a table element having a class attribute equal to 'list-of-links':

puts page.search("//table[@class='list-of-links']//span/text()") # do |row|

The HTML looks like this (td, tr elements omitted for clarity):

<table class="list-of-links">
<span>The content</span>