Register now and start sharing your code snippets.

Using the WWW::Mechanize RubyGem to scrape login protected pages

Ruby posted 5 months ago by christian

This is an example of how to access a login protected site with WWW ::Mechanize. In this example, the login form has two fields named user and password. In other words, the HTML contains the following code:

   1  <input name="user" .../>
   2  <input name="password" .../>

Note that this example also shows how to enable WWW ::Mechanize logging and how to capture the HTML response:

   1  require 'rubygems'
   2  require 'logger'
   3  require 'mechanize'
   4  
   5  agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
   6  #agent.set_proxy('a-proxy', '8080')
   7  page = agent.get 'http://bobthebuilder.com'
   8  
   9  form = page.forms.first
  10  form.user = 'bob'
  11  form.password = 'password'
  12  
  13  page = agent.submit form
  14  
  15  output = File.open("output.html", "w") { |file|  file << page.body }

Use the search method to scrape the page content. In this example I extract all text contained by span elements, which in turn are contained by a table element having a class attribute equal to ‘list-of-links’:

   1  puts page.search("//table[@class='list-of-links']//span/text()") # do |row|

The HTML looks like this (td, tr elements omitted for clarity):

   1  ...
   2  <table class="list-of-links">
   3  ...
   4  <span>The content</span>
   5  ...
   6  </table>
   7  ...

Tagged www, mechanize, scraping, scrape, login, ruby

SSH public key encryption - How to generate the key and how to copy it to the remote machine

Shell Script (Bash) posted 7 months ago by christian

   1  ssh-keygen -t dsa
   2  ssh-copy-id -i ~/.ssh/id_dsa.pub user@server

OS X doesn’t come equipped with ssh-copy-id but you can download the script from here.

Tagged ssh, public key, login, generate

Login to protected resources with curl

Shell Script (Bash) posted about 1 year ago by christian

Cookies are stored and retrieved from cookies.txt. Post data is set using the data switch:

   1  curl --cookie cookies.txt --cookie-jar cookies.txt --user-agent Mozilla/4.0 --data "user=xxxxx&password=xxxxx" http://www.com/login -v
   2  curl --cookie cookies.txt --user-agent Mozilla/4.0 http://www.com/protected/resource

Tagged curl, login, cookies