scrape snippets

Using the WWW::Mechanize RubyGem to scrape login protected pages

Tagged www, mechanize, scraping, scrape, login, ruby  Languages ruby

This is an example of how to access a login protected site with WWW::Mechanize. In this example, the login form has two fields named user and password. In other words, the HTML contains the following code:

<input name="user" .../>
<input name="password" .../>

Note that this example also shows how to enable WWW::Mechanize logging and how to capture the HTML response:

require 'rubygems'
require 'logger'
require 'mechanize'

agent ={|a| a.log = }
#agent.set_proxy('a-proxy', '8080')
page = agent.get ''

form = page.forms.first
form.user = 'bob'
form.password = 'password'

page = agent.submit form

output ="output.html", "w") { |file|  file << page.body }

Use the search method to scrape the page content. In this example I extract all text contained by span elements, which in turn are contained by a table element having a class attribute equal to 'list-of-links':

puts"//table[@class='list-of-links']//span/text()") # do |row|

The HTML looks like this (td, tr elements omitted for clarity):

<table class="list-of-links">
<span>The content</span>

How to scrape a Amazon Listmania list with Hpricot and Ruby

Tagged amazon, hpricot, scrape  Languages ruby
require 'open-uri'
require 'hpricot'
html =  open('')

page = Hpricot(html)

xpath = "td[@class='listItem']//input[@name='asin.1']" do |book|
  puts book['value']

How to scrape web pages with PhantomJS and jQuery

Tagged phantomjs, scrape, jquery  Languages javascript

This is an example of how to scrape the web using PhantomJS and jQuery:

var page = new WebPage(),
    url = 'http://localhost/a-search-form',
    stepIndex = 0;

 * From PhantomJS documentation:
 * This callback is invoked when there is a JavaScript console. The callback may accept up to three arguments: 
 * the string for the message, the line number, and the source identifier.
page.onConsoleMessage = function (msg, line, source) {
    console.log('console> ' + msg);

 * From PhantomJS documentation:
 * This callback is invoked when there is a JavaScript alert. The only argument passed to the callback is the string for the message.
page.onAlert = function (msg) {
    console.log('alert!!> ' + msg);

// Callback is executed each time a page is loaded..., function (status) {
  if (status === 'success') {
    // State is initially empty. State is persisted between page loads and can be used for identifying which page we're on.
    console.log('Step "' + stepIndex + '"');

    // Inject jQuery for scraping (you need to save jquery-1.6.1.min.js in the same folder as this file)

    // Our "event loop"
    } else {

    // Save screenshot for debugging purposes
    page.render("step" + stepIndex++ + ".png");

// Step 1
function initialize() {
  page.evaluate(function() {
    $('form#search input.query').val('Jebus saves');
  // Phantom state doesn't change between page reloads
  // We use the state to store the search result handler, ie. the next step
  phantom.state = parseResults; 

// Step 2
function parseResults() {
  page.evaluate(function() {
    $('#search-result a').each(function(index, link) {
    console.log('Parsed results');
  // If there was a 3rd step we could point to another function
  // but we would have to reload the page for the callback to be called again