gotcha snippets

Scraping Google search results with Scrubyt and Ruby

Tagged web, scraping, google, scrubyt, ruby, gotcha, hpricot, todelete, obsolete  Languages ruby

Note that these instructions don't work with the latest Scrubyt version...

Scrubyt is a Ruby library that allows you to easily scrape the contents of any site.

First install Scrubyt:

$ sudo gem install mechanize hpricot parsetree ruby2ruby scrubyt

You also need to install ReadLine version 3.6.3:

sudo gem install -v 3.6.3 RubyInline

If you install the wrong RubyInline version or have multiple versions installed, you'll get the following error:

/usr/lib/ruby/1.8/rubygems.rb:207:in activate': can't activate RubyInline (= 3.6.3), already activated RubyInline-3.6.6] (Gem::Exception)
       from /usr/lib/ruby/1.8/rubygems.rb:225:in activate'
       from /usr/lib/ruby/1.8/rubygems.rb:224:in each'
       from /usr/lib/ruby/1.8/rubygems.rb:224:in activate'
       from /usr/lib/ruby/1.8/rubygems/custom_require.rb:32:in require'
       from t2:2

To fix it first uninstall the latest version, and keep only version 3.6.3:

sudo gem uninstall RubyInline

Select RubyGem to uninstall:
 1. RubyInline-3.6.3
 2. RubyInline-3.6.6
 3. All versions
> 2

Scraping Google search results

Then run this to Scrape the first two pages of the Google results for ruby:

require 'rubygems'
require 'scrubyt'

# See http://scrubyt.org/example-specification-from-the-page-known-issues-and-pitfalls/

# Create a learning extractor
data = Scrubyt::Extractor.define do
  fetch('http://www.google.com/')
  fill_textfield 'q', 'ruby'
  submit
  
  # Teach Scrubyt what we want to retrieve
  # In this case we want Scruby to find all search results
  # and "Ruby Programming Language" happens to be the first 
  # link in the result list. Change "Ruby Programming Language" 
  # to whatever you want Scruby to find.
  link do
    name  "Ruby Programming Language"
    url   "href", :type => :attribute
  end
  
  # Click next until we're on the second page.
  next_page "Next", :limit => 2
end

# Print out what Scruby found
puts data.to_xml 

puts "Your production scraper has been created: data_extractor_export.rb."

# Export the production version of the scraper
data.export(__FILE__)

Learning Extractor vs Production extractor

Note that this example uses the Learning Extractor functionality of Scrubyt.

The production extractor is generated with the last line:

data.export(__FILE__)

If you open the production extractor in an editor you'll see that it uses XPath queries to extract the content:

link("/html/body/div/div/div/h2", { :generalize => true }) do
    name("/a[1]")
    url("href", { :type => :attribute })
  end

Finding the correct XPath

The learning mode is pretty good at finding the XPath of HTML elements, but if you have difficulties getting Scrubyt to extract exactly what you want, simply install Firebug and use the Inspect feature to select the item you want to extract the value from. Then right-click on it in the Firebug window and choose Copy XPath.

Note that there's a gotcha when copying the XPath of an element with Firebug. Firebug uses Firefox's internal and normalized DOM model, which might not match match the real-world HTML structure. For example the tbody tag is usually added by Firefox/Firebug, and should be removed if it isn't in the HTML.

Another option that I haven't tried myself is to use the XPather extension.

Using hpricot to find the XPath

If you're really having problems finding the right XPath of an element, you can also use HPricot to find it. In this example the code prints out the XPath to all table columns containing the text 51,999:

require 'rexml/document'
require 'hpricot'
require 'open-uri'

url = "http://xyz"

page = Hpricot(open(url,
    'User-Agent' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
        'Referer'    => 'http://xyz'
        ))

page.search( "//td:contains('51,992')" ).each do |row|
  puts row.xpath()
end

The output from the above snippet looks something like this:

/html/body/table[2]/tr[2]/td[3]
/html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]
/html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]/table[1]/tr[2]/td[2]

Note that sometimes I find that hrpicot is easier to use than Scrubyt, so use what's best for you.

Miscellaneous problems

The following problem can be solved by following the instructions found here:

Your production scraper has been created: data_extractor_export.rb.
/var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in extend': wrong argument type Class (expected Module) (TypeError)
       from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in to_sexp'
       from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:93:in parse_tree_for_method'
       from /var/lib/gems/1.8/gems/ruby2ruby-1.1.6/lib/ruby2ruby.rb:1063:in to_sexp'

PHP file upload gotchas

Tagged gotcha, php, file, upload  Languages php

PHP file upload works in mysterious ways:

http://us3.php.net/manual/en/ini.core.php#ini.post-max-size
http://fi.php.net/manual/en/features.file-upload.php#73762
http://de3.php.net/manual/en/features.file-upload.errors.php

How to fix "Mysql::Error: Duplicate entry '2147483647' for key 3: INSERT INTO xxx"

Tagged mysql, numeric, error, gotcha, migration  Languages ruby

2147483647 is the maximum for an integer column in MySQL, so this error probably means you've exceeded this limit somewhere in your code.

Rails automatically detects the best type for your columns, so be sure to specify the correct limit when creating the column with migrations:

# from activerecord-2.1.1/lib/active_record/connection_adapters/mysql_adapter.rb
        case limit
        when 1; 'tinyint'
        when 2; 'smallint'
        when 3; 'mediumint'
        when nil, 4, 11; 'int(11)'  # compatibility with MySQL default
        when 5..8; 'bigint'
        else raise(ActiveRecordError, "No integer type has byte size #{limit}")
        end

This Rails migration code would create a big integer column:

t.integer :product_id, :null => false, :limit => 8

See the section on Numeric Types in the MySQL documentation for more information.

Fixing Phusion Passenger "Error during failsafe response: closed stream"

Tagged passenger, gotcha  Languages bash

This error most probably means that Passenger doesn't have read access to all files:

Error during failsafe response: closed stream
[Thu Aug 13 01:40:05 2009] [error] [client 88.115.162.70] Premature end of script headers: 
[ pid=12581 file=Hooks.cpp:516 time=2009-08-13 01:40:05.753 ]:
  Backend process 18230 did not return a valid HTTP response. It returned: [Status]
*** Exception NoMethodError in application (undefined method []=' for nil:NilClass) (process 18230):

To fix it run:

chown -R xxx.www-data /var/www/xxx

How to fix Shoulda's “Can't find first xxx” problem

Tagged shoulda, gotcha, should_validate_uniqueness_of  Languages ruby

In the following example, Shoulda's should_validate_uniqueness_of might throw an “Can't find first” error:

class PostTest < ActiveSupport::TestCase
  should_validate_uniqueness_of :title

To fix it add a subject to the test:

class PostTest < ActiveSupport::TestCase
  subject { Factory(:post) }
  should_validate_uniqueness_of :title

How to Test Authentication With Devise+Capybara+Minitest

Tagged devise, capybara, authentication, warden, transaction, gotcha, minitest  Languages ruby

Testing authentication functionality with Capybara and Devise? See the following checklist:

* Use shared connections or disable transactional fixtures. * Set Capybara.default_host to match config.session_store.domain or you'll get "401 Unauthorized" * Name of test should end with "integration", e.g. describe "Dashboard Business integration" do * Add the following to your integration tests:

include Warden::Test::Helpers
  Warden.test_mode!

  after do
    Warden.test_reset!
  end

Full example of integration test with Devise, Capybara, and minitest:

class IntegrationSpec < MiniTest::Spec
  include Rails.application.routes.url_helpers
  include Capybara::DSL
  include Warden::Test::Helpers
  Warden.test_mode!

  before do
    @routes = Rails.application.routes
  end

  after do
    Warden.test_reset!
  end

  def sign_in(user)
    login_as(user, scope: :user)
  end

  def sign_out
    logout(:user)
  end

  def default_url_options
    Rails.configuration.action_mailer.default_url_options
  end
end

MiniTest::Spec.register_spec_type( /integration$/, IntegrationSpec )