Using the WWW::Mechanize RubyGem to scrape login protected pages
This is an example of how to access a login protected site with WWW ::Mechanize. In this example, the login form has two fields named user and password. In other words, the HTML contains the following code:
1 <input name="user" .../> 2 <input name="password" .../>
Note that this example also shows how to enable WWW ::Mechanize logging and how to capture the HTML response:
1 require 'rubygems' 2 require 'logger' 3 require 'mechanize' 4 5 agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) } 6 #agent.set_proxy('a-proxy', '8080') 7 page = agent.get 'http://bobthebuilder.com' 8 9 form = page.forms.first 10 form.user = 'bob' 11 form.password = 'password' 12 13 page = agent.submit form 14 15 output = File.open("output.html", "w") { |file| file << page.body }
Use the search method to scrape the page content. In this example I extract all text contained by span elements, which in turn are contained by a table element having a class attribute equal to ‘list-of-links’:
1 puts page.search("//table[@class='list-of-links']//span/text()") # do |row|
The HTML looks like this (td, tr elements omitted for clarity):
1 ... 2 <table class="list-of-links"> 3 ... 4 <span>The content</span> 5 ... 6 </table> 7 ...
How to improve your PageRank with 301 permanent redirects when using Nginx
Mathew Innman of seomoz.org fame wrote about how Digg could increase their revenue by using a so called canonical URL for their whole site. This can be implemented by redirecting users that type in, for example, www.digg.com to digg.com. The reasoning being that instead of having backlinks pointing to two different domains (www and no-www), all backlinks should point to just one, which increases your search engine ranking.
1 if ($host ~* "www") { 2 rewrite ^(.*)$ http://aktagon.com$1 permanent; 3 break; 4 }
Permanent redirects are also a good idea, if you move your content to a new domain—digg.com to dugg.com, for example…
The syntax for the Nginx rewrite module is documented here.