Using the WWW::Mechanize RubyGem to scrape login protected pages
This is an example of how to access a login protected site with WWW ::Mechanize. In this example, the login form has two fields named user and password. In other words, the HTML contains the following code:
1 <input name="user" .../> 2 <input name="password" .../>
Note that this example also shows how to enable WWW ::Mechanize logging and how to capture the HTML response:
1 require 'rubygems' 2 require 'logger' 3 require 'mechanize' 4 5 agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) } 6 #agent.set_proxy('a-proxy', '8080') 7 page = agent.get 'http://bobthebuilder.com' 8 9 form = page.forms.first 10 form.user = 'bob' 11 form.password = 'password' 12 13 page = agent.submit form 14 15 output = File.open("output.html", "w") { |file| file << page.body }
Use the search method to scrape the page content. In this example I extract all text contained by span elements, which in turn are contained by a table element having a class attribute equal to ‘list-of-links’:
1 puts page.search("//table[@class='list-of-links']//span/text()") # do |row|
The HTML looks like this (td, tr elements omitted for clarity):
1 ... 2 <table class="list-of-links"> 3 ... 4 <span>The content</span> 5 ... 6 </table> 7 ...
SSH public key encryption - How to generate the key and how to copy it to the remote machine
1 ssh-keygen -t dsa 2 ssh-copy-id -i ~/.ssh/id_dsa.pub user@server
OS X doesn’t come equipped with ssh-copy-id but you can download the script from here.
Login to protected resources with curl
Cookies are stored and retrieved from cookies.txt. Post data is set using the data switch:
1 curl --cookie cookies.txt --cookie-jar cookies.txt --user-agent Mozilla/4.0 --data "user=xxxxx&password=xxxxx" http://www.com/login -v 2 curl --cookie cookies.txt --user-agent Mozilla/4.0 http://www.com/protected/resource