Register now and start sharing your code snippets.
-->

List all links in a file with Ruby and regular expressions

Ruby posted about 1 year ago by christian

This snippet lists all text links:

   1  data = File.read('the_link_collection.txt')
   2  
   3  links = data.scan /href="([^"]*)[^>]*>([^<]*)<\/a>/im
   4  
   5  links.each do |link|
   6    puts "#{link[1].chomp} = #{link[0]}"
   7  end

Tagged ruby, regex

Scraping Google search results with Scrubyt and Ruby

Ruby posted about 1 year ago by christian

Note that these instructions don’t work with the latest Scrubyt version…

Scrubyt is a Ruby library that allows you to easily scrape the contents of any site.

First install Scrubyt:

   1  $ sudo gem install mechanize hpricot parsetree ruby2ruby scrubyt

You also need to install ReadLine version 3.6.3:

   1  sudo gem install -v 3.6.3 RubyInline

If you install the wrong RubyInline version or have multiple versions installed, you’ll get the following error:

   1  /usr/lib/ruby/1.8/rubygems.rb:207:in `activate': can't activate RubyInline (= 3.6.3), already activated RubyInline-3.6.6] (Gem::Exception)
   2         from /usr/lib/ruby/1.8/rubygems.rb:225:in `activate'
   3         from /usr/lib/ruby/1.8/rubygems.rb:224:in `each'
   4         from /usr/lib/ruby/1.8/rubygems.rb:224:in `activate'
   5         from /usr/lib/ruby/1.8/rubygems/custom_require.rb:32:in `require'
   6         from t2:2

To fix it first uninstall the latest version, and keep only version 3.6.3:

   1  sudo gem uninstall RubyInline
   2  
   3  Select RubyGem to uninstall:
   4   1. RubyInline-3.6.3
   5   2. RubyInline-3.6.6
   6   3. All versions
   7  > 2

Scraping Google search results

Then run this to Scrape the first two pages of the Google results for ruby:

   1  require 'rubygems'
   2  require 'scrubyt'
   3  
   4  # See http://scrubyt.org/example-specification-from-the-page-known-issues-and-pitfalls/
   5  
   6  # Create a learning extractor
   7  data = Scrubyt::Extractor.define do
   8    fetch('http://www.google.com/')
   9    fill_textfield 'q', 'ruby'
  10    submit
  11    
  12    # Teach Scrubyt what we want to retrieve
  13    # In this case we want Scruby to find all search results
  14    # and "Ruby Programming Language" happens to be the first 
  15    # link in the result list. Change "Ruby Programming Language" 
  16    # to whatever you want Scruby to find.
  17    link do
  18      name  "Ruby Programming Language"
  19      url   "href", :type => :attribute
  20    end
  21    
  22    # Click next until we're on the second page.
  23    next_page "Next", :limit => 2
  24  end
  25  
  26  # Print out what Scruby found
  27  puts data.to_xml 
  28  
  29  puts "Your production scraper has been created: data_extractor_export.rb."
  30  
  31  # Export the production version of the scraper
  32  data.export(__FILE__)

Learning Extractor vs Production extractor

Note that this example uses the Learning Extractor functionality of Scrubyt.

The production extractor is generated with the last line:

   1  data.export(__FILE__)

If you open the production extractor in an editor you’ll see that it uses XPath queries to extract the content:

   1  link("/html/body/div/div/div/h2", { :generalize => true }) do
   2      name("/a[1]")
   3      url("href", { :type => :attribute })
   4    end

Finding the correct XPath

The learning mode is pretty good at finding the XPath of HTML elements, but if you have difficulties getting Scrubyt to extract exactly what you want, simply install Firebug and use the Inspect feature to select the item you want to extract the value from. Then right-click on it in the Firebug window and choose Copy XPath.

Note that there’s a gotcha when copying the XPath of an element with Firebug. Firebug uses Firefox’s internal and normalized DOM model, which might not match match the real-world HTML structure. For example the tbody tag is usually added by Firefox/Firebug, and should be removed if it isn’t in the HTML.

Another option that I haven’t tried myself is to use the XPather extension.

Using hpricot to find the XPath

If you’re really having problems finding the right XPath of an element, you can also use HPricot to find it. In this example the code prints out the XPath to all table columns containing the text 51,999:

   1  require 'rexml/document'
   2  require 'hpricot'
   3  require 'open-uri'
   4  
   5  url = "http://xyz"
   6  
   7  page = Hpricot(open(url,
   8  	'User-Agent' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
   9          'Referer'    => 'http://xyz'
  10      	))
  11  
  12  page.search( "//td:contains('51,992')" ).each do |row|
  13    puts row.xpath()
  14  end

The output from the above snippet looks something like this:

   1  /html/body/table[2]/tr[2]/td[3]
   2  /html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]
   3  /html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]/table[1]/tr[2]/td[2]

Note that sometimes I find that hrpicot is easier to use than Scrubyt, so use what’s best for you.

Miscellaneous problems

The following problem can be solved by following the instructions found here:

   1  Your production scraper has been created: data_extractor_export.rb.
   2  /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in `extend': wrong argument type Class (expected Module) (TypeError)
   3         from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in `to_sexp'
   4         from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:93:in `parse_tree_for_method'
   5         from /var/lib/gems/1.8/gems/ruby2ruby-1.1.6/lib/ruby2ruby.rb:1063:in `to_sexp'

Tagged web, scraping, google, scrubyt, ruby, gotcha, hpricot, todelete, obsolete

How to run multiple Rails applications from the same directory

Ruby posted about 1 year ago by christian

Set this in environment.rb:

   1  ActionController::AbstractRequest.relative_url_root = "/appname/"
   2  ActionController::CgiRequest.relative_url_root = "/appname/"

Tagged rails, nginx, apache

How to parse RI generated documentation using RDoc and Ruby

Ruby posted about 1 year ago by christian

RI stores the generated documentation as YAML files. This code uses RDoc to parse the YAML files:

   1  require 'yaml'
   2  require 'find'
   3  require "rdoc/ri/ri_driver"
   4  
   5  dirs = RI::Paths::PATH
   6  dirs.each do |dir|
   7    Find.find(dir) do |fn|
   8      next unless File.file?(fn)
   9      doc = YAML.load(File.read(fn))
  10      next unless doc.respond_to?(:comment)
  11      next unless doc.comment
  12      
  13      # Print name of object
  14      puts doc.full_name
  15      
  16      # Print the body: RDoc comments, but only partial...
  17      puts doc.comment.map{|f| f.body if f.respond_to?(:body)}.join("\n")
  18    end
  19  end

Originally from the article Fun with Ferret.

Tagged rdoc, ri, documentation, ruby

How to parse Ruby source code documentation with RDoc and a custom RDoc generator

Ruby posted about 1 year ago by christian

This is a skeleton for an RDoc generator that extends the existing HtmlGenerator. This means we get the same documentation as seen at, for example, http://api.rubyonrails.org/; with links and HTML formatted documentation.

It can be used for doing whatever you would like and can imagine doing with RDoc documentation. Currently it prints out the files, modules, classes and methods found in the processesed files.

To use it, create a new file named custom_generator.rb in the Ruby installation and the subfolder /rdoc/generators. Then put the following code in the file:

   1  require 'rdoc/generators/html_generator'
   2  
   3  module Generators
   4  
   5    class HTMLGenerator
   6    
   7      # We don't need a template
   8      def load_html_template
   9      end
  10  
  11      def generate(toplevels)             
  12        @toplevels  = toplevels
  13        @files      = []
  14        @classes    = []
  15  
  16        build_indices
  17        
  18        puts "===================="
  19        puts "Files"
  20        puts "===================="
  21        
  22        @files.each do |item|
  23          puts item.name
  24          #values = file.value_hash
  25          #puts item.description
  26        end
  27        
  28        puts "===================="
  29        puts "Modules and classes"
  30        puts "===================="           
  31        
  32        @classes.each do |item|
  33          puts item.name
  34          #values = file.value_hash
  35          #puts item.description
  36        end
  37        
  38        puts "===================="
  39        puts "Methods"
  40        puts "===================="      
  41        
  42        HtmlMethod.all_methods.each do |item|
  43          puts item.name
  44        end
  45      end
  46    end
  47    
  48    class HtmlFile
  49      # Add a description method, after all HtmlMethod has it...
  50      def description
  51        value_hash if @values.size == 0      
  52        @values["description"]
  53      end
  54    end
  55    class HtmlClass
  56      # Add a description method, after all HtmlMethod has it...
  57      def description
  58        value_hash if @values.size == 0      
  59        @values["description"]
  60      end
  61    end
  62    
  63    class CUSTOMGenerator < HTMLGenerator
  64    end
  65  
  66  end

Then run the custom generator by using the fmt parameter:

   1  rdoc --fmt custom lib/base64.rb lib/pp.rb

You can also control RDoc programatically, with the following code:

   1  #!/usr/bin/env ruby
   2  require 'rdoc/rdoc'
   3  
   4  `rm -rf doc`
   5  
   6  begin
   7    r = RDoc::RDoc.new
   8    r.document(['--inline-source', '--fmt', 'custom'] + ARGV)
   9  rescue RDoc::RDocError => e
  10    $stderr.puts e.message
  11    exit(1)
  12  end

Tagged ruby, rdoc, generator, documentation