Using the WWW::Mechanize RubyGem to scrape login protected pages
This is an example of how to access a login protected site with WWW ::Mechanize. In this example, the login form has two fields named user and password. In other words, the HTML contains the following code:
1 <input name="user" .../> 2 <input name="password" .../>
Note that this example also shows how to enable WWW ::Mechanize logging and how to capture the HTML response:
1 require 'rubygems' 2 require 'logger' 3 require 'mechanize' 4 5 agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) } 6 #agent.set_proxy('a-proxy', '8080') 7 page = agent.get 'http://bobthebuilder.com' 8 9 form = page.forms.first 10 form.user = 'bob' 11 form.password = 'password' 12 13 page = agent.submit form 14 15 output = File.open("output.html", "w") { |file| file << page.body }
Use the search method to scrape the page content. In this example I extract all text contained by span elements, which in turn are contained by a table element having a class attribute equal to ‘list-of-links’:
1 puts page.search("//table[@class='list-of-links']//span/text()") # do |row|
The HTML looks like this (td, tr elements omitted for clarity):
1 ... 2 <table class="list-of-links"> 3 ... 4 <span>The content</span> 5 ... 6 </table> 7 ...