hpricot snippets

Scraping Google search results with Scrubyt and Ruby

Tagged web, scraping, google, scrubyt, ruby, gotcha, hpricot, todelete, obsolete  Languages ruby

Note that these instructions don't work with the latest Scrubyt version...

Scrubyt is a Ruby library that allows you to easily scrape the contents of any site.

First install Scrubyt:

$ sudo gem install mechanize hpricot parsetree ruby2ruby scrubyt

You also need to install ReadLine version 3.6.3:

sudo gem install -v 3.6.3 RubyInline

If you install the wrong RubyInline version or have multiple versions installed, you'll get the following error:

/usr/lib/ruby/1.8/rubygems.rb:207:in activate': can't activate RubyInline (= 3.6.3), already activated RubyInline-3.6.6] (Gem::Exception)
       from /usr/lib/ruby/1.8/rubygems.rb:225:in activate'
       from /usr/lib/ruby/1.8/rubygems.rb:224:in each'
       from /usr/lib/ruby/1.8/rubygems.rb:224:in activate'
       from /usr/lib/ruby/1.8/rubygems/custom_require.rb:32:in require'
       from t2:2

To fix it first uninstall the latest version, and keep only version 3.6.3:

sudo gem uninstall RubyInline

Select RubyGem to uninstall:
 1. RubyInline-3.6.3
 2. RubyInline-3.6.6
 3. All versions
> 2

Scraping Google search results

Then run this to Scrape the first two pages of the Google results for ruby:

require 'rubygems'
require 'scrubyt'

# See http://scrubyt.org/example-specification-from-the-page-known-issues-and-pitfalls/

# Create a learning extractor
data = Scrubyt::Extractor.define do
  fetch('http://www.google.com/')
  fill_textfield 'q', 'ruby'
  submit
  
  # Teach Scrubyt what we want to retrieve
  # In this case we want Scruby to find all search results
  # and "Ruby Programming Language" happens to be the first 
  # link in the result list. Change "Ruby Programming Language" 
  # to whatever you want Scruby to find.
  link do
    name  "Ruby Programming Language"
    url   "href", :type => :attribute
  end
  
  # Click next until we're on the second page.
  next_page "Next", :limit => 2
end

# Print out what Scruby found
puts data.to_xml 

puts "Your production scraper has been created: data_extractor_export.rb."

# Export the production version of the scraper
data.export(__FILE__)

Learning Extractor vs Production extractor

Note that this example uses the Learning Extractor functionality of Scrubyt.

The production extractor is generated with the last line:

data.export(__FILE__)

If you open the production extractor in an editor you'll see that it uses XPath queries to extract the content:

link("/html/body/div/div/div/h2", { :generalize => true }) do
    name("/a[1]")
    url("href", { :type => :attribute })
  end

Finding the correct XPath

The learning mode is pretty good at finding the XPath of HTML elements, but if you have difficulties getting Scrubyt to extract exactly what you want, simply install Firebug and use the Inspect feature to select the item you want to extract the value from. Then right-click on it in the Firebug window and choose Copy XPath.

Note that there's a gotcha when copying the XPath of an element with Firebug. Firebug uses Firefox's internal and normalized DOM model, which might not match match the real-world HTML structure. For example the tbody tag is usually added by Firefox/Firebug, and should be removed if it isn't in the HTML.

Another option that I haven't tried myself is to use the XPather extension.

Using hpricot to find the XPath

If you're really having problems finding the right XPath of an element, you can also use HPricot to find it. In this example the code prints out the XPath to all table columns containing the text 51,999:

require 'rexml/document'
require 'hpricot'
require 'open-uri'

url = "http://xyz"

page = Hpricot(open(url,
    'User-Agent' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
        'Referer'    => 'http://xyz'
        ))

page.search( "//td:contains('51,992')" ).each do |row|
  puts row.xpath()
end

The output from the above snippet looks something like this:

/html/body/table[2]/tr[2]/td[3]
/html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]
/html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]/table[1]/tr[2]/td[2]

Note that sometimes I find that hrpicot is easier to use than Scrubyt, so use what's best for you.

Miscellaneous problems

The following problem can be solved by following the instructions found here:

Your production scraper has been created: data_extractor_export.rb.
/var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in extend': wrong argument type Class (expected Module) (TypeError)
       from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in to_sexp'
       from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:93:in parse_tree_for_method'
       from /var/lib/gems/1.8/gems/ruby2ruby-1.1.6/lib/ruby2ruby.rb:1063:in to_sexp'

Scraping Yahoo! Finance with Ruby and Hpricot

Tagged yahoo, finance, ruby, hpricot  Languages css

This code extracts the numbers from the Fund operations table on the BLV fund's Profile page at Yahoo! Finance.

require 'rubygems'
require 'hpricot'
require 'open-uri'

page = Hpricot(open('http://finance.yahoo.com/q/pr?s=BLV'))

fund_operations = []
page.search( "//table[@class='yfnc_datamodoutline1']" ).each do |row|
  row.search( "//td[@class='yfnc_datamoddata1']").each do |data|
    fund_operations << data.inner_html
  end
end

pp fund_operations

The output from this script is:

["N/A", "N/A", "55%", "72", "85.05M", "1.71B"]

Note that you could also use Scrubyt for this. Here's a snippet that explains how to use Scrubyt to scrape web pages: Scraping Google search results with Scrubyt and Ruby

Hpricot's inner_text doesn't handle HTML entities correctly

Tagged hpricot, inner_text, problem, bug  Languages ruby

Hpricot's inner_text method is fubar and doesn't handle HTML entities correctly, instead you'll see questionmarks in the output. To fix this replace calls to Hpricot's inner_text with a call to the following method (or Monkey patch Hpricot):

require 'rubygems'
require 'htmlentities'

  def inner_text(node)
     text = node.innerHTML.gsub(%r{<.*?>}, "").strip
     HTMLEntities.new.decode(text)
  end

Remember to install the htmlentities gem:

sudo gem install htmlentities

How to fix issues with missing gem specifications

Tagged hpricot, gem, unpack, rails  Languages bash

I was getting this error after unpacking hpricot with gem unpack hpricot. I also tried rake gems:unpack hpricot but it did nothing...

config.gem: Unpacked gem hpricot-0.8.1 in vendor/gems has no specification file. Run 'rake gems:refresh_specs' to fix this.

The rake gems:refresh_specs command doesn't work, and appears to have been a temporary workaround, so to fix this error I did this:

cd vendor/gems/hpricot-0.8.1
gem specification hpricot > .specification

I had this issue with Rails 2.3.4.

How to scrape a Amazon Listmania list with Hpricot and Ruby

Tagged amazon, hpricot, scrape  Languages ruby
require 'open-uri'
require 'hpricot'
html =  open('http://www.amazon.com/Nick-Hornby-and-Company/lm/1X1GGDBXARHZ6/ref=cm_lm_toplist_fullview_1')

page = Hpricot(html)

xpath = "td[@class='listItem']//input[@name='asin.1']"

page.search(xpath).each do |book|
  puts book['value']
end

How to fetch delicious data with Hpricot and OpenURI

Tagged delicious, hpricot, ruby, rss, feed  Languages ruby

The code:

class Delicious
  class << self
    def tag(username, name, count = 15)
      links = []
      url = "http://feeds.delicious.com/v2/rss/#{username}/#{name}?count=#{count}"
      feed = Hpricot(open(url))

      feed.search("item").each do |i|
        item = OpenStruct.new
        item.link = i.at('link').next.to_s
        item.title = i.at('title').innerHTML
        item.description  = i.at('description').innerHTML rescue nil

        links << item
      end

      links
    end
  end
end

Usage:

# Return last 15 items tagged with business and news from jebus's account:
Delicious.tag 'jebus', 'business+news', 15

Returns an array of items.