ruby

Capistrano 2 task for backing up your MySQL production database before each deployment

Tagged ruby, rails, mysql, backup, capistrano  Languages ruby

This Capistrano task connects to your production database and dumps the contents to a file. The file is compressed and put in a directory specified with set :backup_dir, "#{deploy_to}/backups". This is a slight modification of http://pastie.caboo.se/42574. All credit to court3nay.

task :backup, :roles => :db, :only => { :primary => true } do
  filename = "#{backup_dir}/#{application}.dump.#{Time.now.to_f}.sql.bz2"
  text = capture "cat #{deploy_to}/current/config/database.yml"
  yaml = YAML::load(text)

  on_rollback { run "rm #{filename}" }
  run "mysqldump -u #{yaml['production']['username']} -p #{yaml['production']['database']} | bzip2 -c > #{filename}" do |ch, stream, out|
    ch.send_data "#{yaml['production']['password']}\n" if out =~ /^Enter password:/
  end
end

To automatically backup your data before you deploy a new version add this to config/deploy.rb:

task :before_deploy do
    backup
  end

To restore the backup run the following command:

mysql database_name -uroot < filename.sql

How to benchmark your Ruby code

Tagged benchmark, performance, ruby  Languages ruby

You can easily benchmark your Ruby code like this:

require 'benchmark'
seconds = Benchmark.realtime do
    sleep 1
end
print "#{seconds} elapsed..."

The output should be close to 1 second.

How to parse RI generated documentation using RDoc and Ruby

Tagged rdoc, ri, documentation, ruby  Languages ruby

RI stores the generated documentation as YAML files. This code uses RDoc to parse the YAML files:

require 'yaml'
require 'find'
require "rdoc/ri/ri_driver"

dirs = RI::Paths::PATH
dirs.each do |dir|
  Find.find(dir) do |fn|
    next unless File.file?(fn)
    doc = YAML.load(File.read(fn))
    next unless doc.respond_to?(:comment)
    next unless doc.comment
    
    # Print name of object
    puts doc.full_name
    
    # Print the body: RDoc comments, but only partial...
    puts doc.comment.map{|f| f.body if f.respond_to?(:body)}.join("\n")
  end
end

Originally from the article Fun with Ferret.

Scraping Google search results with Scrubyt and Ruby

Tagged web, scraping, google, scrubyt, ruby, gotcha, hpricot, todelete, obsolete  Languages ruby

Note that these instructions don't work with the latest Scrubyt version...

Scrubyt is a Ruby library that allows you to easily scrape the contents of any site.

First install Scrubyt:

$ sudo gem install mechanize hpricot parsetree ruby2ruby scrubyt

You also need to install ReadLine version 3.6.3:

sudo gem install -v 3.6.3 RubyInline

If you install the wrong RubyInline version or have multiple versions installed, you'll get the following error:

/usr/lib/ruby/1.8/rubygems.rb:207:in activate': can't activate RubyInline (= 3.6.3), already activated RubyInline-3.6.6] (Gem::Exception)
       from /usr/lib/ruby/1.8/rubygems.rb:225:in activate'
       from /usr/lib/ruby/1.8/rubygems.rb:224:in each'
       from /usr/lib/ruby/1.8/rubygems.rb:224:in activate'
       from /usr/lib/ruby/1.8/rubygems/custom_require.rb:32:in require'
       from t2:2

To fix it first uninstall the latest version, and keep only version 3.6.3:

sudo gem uninstall RubyInline

Select RubyGem to uninstall:
 1. RubyInline-3.6.3
 2. RubyInline-3.6.6
 3. All versions
> 2

Scraping Google search results

Then run this to Scrape the first two pages of the Google results for ruby:

require 'rubygems'
require 'scrubyt'

# See http://scrubyt.org/example-specification-from-the-page-known-issues-and-pitfalls/

# Create a learning extractor
data = Scrubyt::Extractor.define do
  fetch('http://www.google.com/')
  fill_textfield 'q', 'ruby'
  submit
  
  # Teach Scrubyt what we want to retrieve
  # In this case we want Scruby to find all search results
  # and "Ruby Programming Language" happens to be the first 
  # link in the result list. Change "Ruby Programming Language" 
  # to whatever you want Scruby to find.
  link do
    name  "Ruby Programming Language"
    url   "href", :type => :attribute
  end
  
  # Click next until we're on the second page.
  next_page "Next", :limit => 2
end

# Print out what Scruby found
puts data.to_xml 

puts "Your production scraper has been created: data_extractor_export.rb."

# Export the production version of the scraper
data.export(__FILE__)

Learning Extractor vs Production extractor

Note that this example uses the Learning Extractor functionality of Scrubyt.

The production extractor is generated with the last line:

data.export(__FILE__)

If you open the production extractor in an editor you'll see that it uses XPath queries to extract the content:

link("/html/body/div/div/div/h2", { :generalize => true }) do
    name("/a[1]")
    url("href", { :type => :attribute })
  end

Finding the correct XPath

The learning mode is pretty good at finding the XPath of HTML elements, but if you have difficulties getting Scrubyt to extract exactly what you want, simply install Firebug and use the Inspect feature to select the item you want to extract the value from. Then right-click on it in the Firebug window and choose Copy XPath.

Note that there's a gotcha when copying the XPath of an element with Firebug. Firebug uses Firefox's internal and normalized DOM model, which might not match match the real-world HTML structure. For example the tbody tag is usually added by Firefox/Firebug, and should be removed if it isn't in the HTML.

Another option that I haven't tried myself is to use the XPather extension.

Using hpricot to find the XPath

If you're really having problems finding the right XPath of an element, you can also use HPricot to find it. In this example the code prints out the XPath to all table columns containing the text 51,999:

require 'rexml/document'
require 'hpricot'
require 'open-uri'

url = "http://xyz"

page = Hpricot(open(url,
    'User-Agent' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
        'Referer'    => 'http://xyz'
        ))

page.search( "//td:contains('51,992')" ).each do |row|
  puts row.xpath()
end

The output from the above snippet looks something like this:

/html/body/table[2]/tr[2]/td[3]
/html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]
/html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]/table[1]/tr[2]/td[2]

Note that sometimes I find that hrpicot is easier to use than Scrubyt, so use what's best for you.

Miscellaneous problems

The following problem can be solved by following the instructions found here:

Your production scraper has been created: data_extractor_export.rb.
/var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in extend': wrong argument type Class (expected Module) (TypeError)
       from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in to_sexp'
       from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:93:in parse_tree_for_method'
       from /var/lib/gems/1.8/gems/ruby2ruby-1.1.6/lib/ruby2ruby.rb:1063:in to_sexp'

A simple Jabber/XMPP bot that uses the Jabber:Simple library

Tagged jabber, xmpp, gmail, gtalk, bot  Languages ruby

First install Jabber::Simple:

$sudo gem install xmpp4r-simple -y

On OSX you might get this error when installing xmpp4r-simple and the rdoc dependency:

make
gcc -I. -I/usr/local/lib/ruby/1.8/i686-darwin8.10.3 -I/usr/local/lib/ruby/1.8/i686-darwin8.10.3 -I.  -fno-common -g -O2  -fno-common -pipe -fno-common  -c callsite.c
gcc -I. -I/usr/local/lib/ruby/1.8/i686-darwin8.10.3 -I/usr/local/lib/ruby/1.8/i686-darwin8.10.3 -I.  -fno-common -g -O2  -fno-common -pipe -fno-common  -c rcovrt.c
cc -dynamic -bundle -undefined suppress -flat_namespace  -L"/usr/local/lib" -o rcovrt.bundle callsite.o rcovrt.o  -lruby  -lpthread -ldl -lobjc  
/usr/bin/ld: /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libpthread.dylib unknown flags (type) of section 6 (__TEXT,__dof_plockstat) in load command 0
/usr/bin/ld: /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libdl.dylib unknown flags (type) of section 6 (__TEXT,__dof_plockstat) in load command 0
/usr/bin/ld: /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libobjc.dylib load command 9 unknown cmd field
/usr/bin/ld: /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib unknown flags (type) of section 6 (__TEXT,__dof_plockstat) in load command 0
/usr/bin/ld: /usr/lib/libSystem.B.dylib unknown flags (type) of section 6 (__TEXT,__dof_plockstat) in load command 0
collect2: ld returned 1 exit status
make: *** [rcovrt.bundle] Error 1

Simply install XCode 3 to make the error go away, then run this code to start the bot--warning the bot will execute the message body, for example "ls -la", on the system:

require 'rubygems'
require 'xmpp4r-simple'

include Jabber
#Jabber::debug = true

jid = 'user@server.com'
pass = 'password'

jabber = Simple.new(jid, pass)

loop do
  messages = jabber.received_messages
  messages.each do |message| 
    body = message.body if message.type == :chat
    
    process = IO.popen(body)
    result = process.readlines
    
    jabber.deliver('some.user@gmail.com', result)
  end
      
  sleep 1
end

To use GTalk from another domain than gmail, you need to edit the Jabber::Simple source code...

Parsing feeds with Ruby and the FeedTools gem

Tagged feedtools, rss, atom, parser, ruby, content encoding, utf-8, iso-8859-1  Languages ruby

This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on...

The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”

Installing

$ sudo gem install feedtools

Fetching and parsing a feed

Easy...

require 'rubygems'
require 'feed_tools'
feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss')

puts feed.title
puts feed.link
puts feed.description

for item in feed.items
  puts item.title
  puts item.link
  puts item.content
end

Feed autodiscovery

FeedTools finds the Slashdot feed for you.

puts FeedTools::Feed.open('http://www.slashdot.org').href

Helpers

FeedTools can also cleanup your dirty XML/HTML:

require 'feed_tools'
require 'feed_tools/helpers/feed_tools_helper'

FeedTools::HtmlHelper.tidy_html(html)

Database cache

FeedTools can also store the fetched feeds for you:

FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"

The schema contains all you need:

-- Example MySQL schema
  CREATE TABLE cached_feeds (
    id              int(10) unsigned NOT NULL auto_increment,
    href            varchar(255) default NULL,
    title           varchar(255) default NULL,
    link            varchar(255) default NULL,
    feed_data       longtext default NULL,
    feed_data_type  varchar(20) default NULL,
    http_headers    text default NULL,
    last_retrieved  datetime default NULL,
    time_to_live    int(10) unsigned NULL,
    serialized       longtext default NULL,
    PRIMARY KEY  (id)
  )

There's even a Rails migration file included.

Feed updater

There's also a feed updater tool that can fetch feeds in the background, but I haven't had time to look at it yet.

sudo gem install feedupdater

Character set/encoding bug

As always, there are bugs that you need to be aware of, Feedtools is no different. There's an encoding bug, FeedTools encodes everything to ISO-8859-1, instead UTF-8 which should be the default encoding.

To fix it use the following code:

ic = Iconv.new('ISO-8859-1', 'UTF-8')
feed.description = ic.iconv(feed.description)

You can also try this patch.

cd /usr/local/lib/ruby/gems/1.8/gems/
wget http://n0life.org/~julbouln/feedtools_encoding.patch
patch -p1 feedtools_encoding.patch

The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial

Time estimation

By default FeedTools will try to estimate when a feed item was published, if it's not available from the feed. This annoys me and will create weird publish dates, so usually it's a good idea to disable it with the timestamp_estimation_enabled option:

FeedTools.reset_configurations
FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = nil
FeedTools.configurations[:default_ttl]   = 15.minutes
FeedTools.configurations[:timestamp_estimation_enabled] = false

Configuration options

To see a list of available configuration options run the following code:

pp FeedTools.configurations

How to use Vlad the Deployer with git, nginx, mongrel, mongrel_cluster and Rails

Tagged vlad, deployer, deploy, capistrano, nginx, mongrel, mongrel_cluster  Languages ruby

This is a draft...

Installing Vlad the Deployer

gem install vlad

Configuring Vlad the Deployer

Add this to the end of RakeFile:

begin
  require 'rubygems'
  require 'vlad'
  Vlad.load :scm => :git
rescue LoadError => e
  puts "Unable to load Vlad #{e}."
end

Note that we're telling Vlad to use git. This snippet gives you a quick introduction on how to use git with Rails.

Creating the deployment recipe

If you're uncertain what these variables mean, have a look at the docs. This folder is also worth a look, and don't forget to take a peek at the vlad source code.

#
# General configuration
#
set :ssh_flags,             '-p 666'
set :application,           'xxx.com'
set :domain,                '127.0.01'
set :deploy_to,             '/var/www/xxx.com'
set :repository,            '/var/lib/git/repositories/xxx.com/.git/'


#
# Mongrel configuration
#
set :mongrel_clean,         true
set :mongrel_command,       'sudo mongrel_rails'
set :mongrel_group,         'www-data'
set :mongrel_port,          9000
set :mongrel_servers,       3

#set :mongrel_address,       '127.0.0.1'
#set(:mongrel_conf)          { '#{shared_path}/mongrel_cluster.conf' }
#set :mongrel_config_script, nil
#set :mongrel_environment,   'production'
#set :mongrel_log_file,      nil
#set :mongrel_pid_file,      nil
#set :mongrel_prefix,        nil
#set :mongrel_user,          'mongrel'

#
# Customize Vlad to our needs
#
namespace :vlad do
  #
  # Add an after_update hook
  #
  remote_task :update do
    Rake::Task['vlad:after_update'].invoke
  end

  #
  # The after_update hook, which is run after vlad:update
  #
  remote_task :after_update do
  # Link to shared resources, if you have them in .gitignore
  #  run "ln -s #{deploy_to}/shared/system/database.yml #{deploy_to}/current/config/database.yml"
  end

  #
  # Deploys a new version of your application
  #
  remote_task :deploy => [:update, :migrate, :start_app]
end

Setup the server

$ rake vlad:setup

This will create the necessary folders and mongrel_cluster configuration file.

Deploy the application

Now deploy the application with vlad:deploy, which is a custom rake task that we added to the deployment recipe:

$ rake vlad:deploy

Copying your SSH public key to the remote server

Vlad uses ssh for executing commands on the remotely, and rsync for copying the build to your server, which means you'll quickly grow tired of typing your password each time a command is run.

This problem is solved by copying your public SSH keys to the remote server, this snippet explains how to do exactly that.

Using backgroundrb to execute tasks asynchronously in Rails

Tagged backgroundrb, rails, ruby, distributed, messaging  Languages ruby

Draft...

Planning on using BackgroundDRB? Take a long look at the alternatives first

Ask yourself, do you really need a complex solution like BackgroundDRB? Most likely you don't, so use a simple daemonized process instead, see this snippet about the daemons gem for more information.

Heck, even a simple Ruby script run by cron every 5 minutes will be more stable than BackgroundDRB and require less work.

Even if you really need to process a lot of data asynchronously in the background, I wouldn't recommend BackgroundDRB, it's riddled with bugs and unstable in production, so use the BJ plugin instead.

Anyway, continue reading if you want to use BackgroundDRB...

Installing the prerequisites:

$ sudo gem install chronic packet

Installing backgroundrb

$ cd rails_project
$ git clone git://gitorious.org/backgroundrb/mainline.git vendor/plugins/backgroundrb

You can also get the latest stable version from the Subversion repository:

svn co http://svn.devjavu.com/backgroundrb/trunk  vendor/plugins/backgroundrb

Setup backgroundrb

rake backgroundrb:setup

Create a worker

./script/generate worker feeds_worker
class FeedsWorker < BackgrounDRb::MetaWorker
  set_worker_name :feeds_worker
  
  def create(args = nil)
    # this method is called, when worker is loaded for the first time
    logger.info "Created feeds worker"
  end
  
  def update(data)
    logger.info "Updating #{Feed.count} feeds."
    
    seconds = Benchmark.realtime do
      thread_pool.defer do
        Feed.update_all()
      end
    end

    logger.info "Update took #{'%.5f' % seconds}."
  end
end

Starting backgroundrb

First configure backgroundrb by opening config/backgroundrb.yml in your editor:

:backgroundrb:
  :ip: 0.0.0.0

:development:
  :backgroundrb:
    :port: 11111     # use port 11111
    :log: foreground # foreground mode,print log messages on console

:production:
  :backgroundrb:
    :port: 22222      # use port 22222

Next, start backgroundrb in development mode:

./script/backgroundrb -e development &

Call your worker

From the command line:

$ script/console
Loading development environment (Rails 2.0.2)
>> MiddleMan.worker(:feeds_worker).update()

When things go wrong

Asynchronous programming is complex, so expect bugs...

Rule #1 know who you're calling.

If you give your MiddleMan the wrong name of your worker, he'll just spit this crap at you:

You have a nil object when you didn't expect it!
The error occurred while evaluating nil.send_request
/usr/local/lib/ruby/gems/1.8/gems/packet-0.1.5/lib/packet/packet_master.rb:44:in ask_worker'
/Users/christian/Documents/Projects/xxx/vendor/plugins/backgroundrb/server/lib/master_worker.rb:104:in process_work'
/Users/christian/Documents/Projects/xxx/vendor/plugins/backgroundrb/server/lib/master_worker.rb:35:in receive_data'
/usr/local/lib/ruby/gems/1.8/gems/packet-0.1.5/lib/packet/packet_parser.rb:29:in call'
/usr/local/lib/ruby/gems/1.8/gems/packet-0.1.5/lib/packet/packet_parser.rb:29:in extract'
/Users/christian/Documents/Projects/xxx/vendor/plugins/backgroundrb/server/lib/master_worker.rb:31:in receive_data'

So for example this command would generate the above mentioned error:

MiddleMan.worker(:illegal_worker).update()

It's always nice to see a cryptic error messages such as this, it really deserves an award.

Check for bugs and bug fixes

git mainline commits

Going to production

Starting the daemon:

./script/backgroundrb -e production start

Configuring your task to run periodically

The following example makes backgroundrb call the FeedsWorker's update method once every 15 minutes:

:production:
  :backgroundrb:
    :port: 22222      # use port 22222
    :lazy_load: true  # do not load models eagerly
    :debug_log: false # disable log workers and other logging
# Cron based scheduling
:schedules:
  :feeds_worker:
    :update:
      :trigger_args: * */15 * * * *
      :data: "Hello world"

At the time of writing, the cron scheduler seems to be broken, so I prefer hard-coding the interval in the worker's create method:

def create
           add_periodic_timer(15.minutes) { update }
         end

If using Vlad or Capistrano, it's also a good idea to fix script/backgroundrb by changing these lines:

pid_file = "#{RAILS_HOME}/../../shared/pids/backgroundrb_#{CONFIG_FILE[:backgroundrb][:port]}.pid"
SERVER_LOGGER = "#{RAILS_HOME}/../../shared/log/backgroundrb_server_#{CONFIG_FILE[:backgroundrb][:port]}.log"

Resources

Backgroundrb homepage

Backgroundrb best practices

Backgroundrb scheduling

Debugging backgroundrb

Backroundrb's README

topfunky's messaging article

Example of how to fetch a URL with Net:HTTP and Ruby

Tagged net, http, ruby, example, headers  Languages ruby
require 'net/http'
require 'net/https'

url = URI.parse('http://www.google.com/yo?query=yahoo')

http = Net::HTTP.new(url.host, url.port)

http.open_timeout = http.read_timeout = 10  # Set open and read timeout to 10 seconds
http.use_ssl = (url.scheme == "https")
       
headers = {
  'User-Agent'          => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
  'If-Modified-Since'   => '',
  'If-None-Match'       => ''
}

# Note to self, use request_uri not path: http://www.ruby-doc.org/core/classes/URI/HTTP.html#M004934
response, body = http.get(url.request_uri, headers)

puts response.code
puts response.message

response.each {|key, val| puts key + ' = ' + val}