Using the WWW::Mechanize RubyGem to scrape login protected pages
This is an example of how to access a login protected site with WWW ::Mechanize. In this example, the login form has two fields named user and password. In other words, the HTML contains the following code:
1 <input name="user" .../> 2 <input name="password" .../>
Note that this example also shows how to enable WWW ::Mechanize logging and how to capture the HTML response:
1 require 'rubygems' 2 require 'logger' 3 require 'mechanize' 4 5 agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) } 6 #agent.set_proxy('a-proxy', '8080') 7 page = agent.get 'http://bobthebuilder.com' 8 9 form = page.forms.first 10 form.user = 'bob' 11 form.password = 'password' 12 13 page = agent.submit form 14 15 output = File.open("output.html", "w") { |file| file << page.body }
Use the search method to scrape the page content. In this example I extract all text contained by span elements, which in turn are contained by a table element having a class attribute equal to ‘list-of-links’:
1 puts page.search("//table[@class='list-of-links']//span/text()") # do |row|
The HTML looks like this (td, tr elements omitted for clarity):
1 ... 2 <table class="list-of-links"> 3 ... 4 <span>The content</span> 5 ... 6 </table> 7 ...
How to add OpenID support to your Rails application with the open_id_authentication plugin
These instructions have been tested with Rails 2.0.2 and ruby-openid 2.0.4. The snippet is an adaptation of the instructions in Ryan Bates’ screencast on how to integrate OpenID with Rails.
Installing and configuring the restful_authentication plugin
Follow these instructions: How to install and use the restful_authentication Rails plugin.
Installing the ruby-openid gem
1 gem install ruby-openid
Installing the open_id_authentication Rails plugin
1 script/plugin source http://svn.techno-weenie.net/projects/plugins/ 2 script/plugin install open_id_authentication
Create the migration files
1 rake open_id_authentication:db:create
Add the following to the self.up method in 002_add_open_id_authentication_tables.rb:
1 add_column :users, :identity_url, :string
Configuring the routes
1 map.open_id_complete 'session', :controller => "sessions", :action => "create", :requirements => { :method => :get }
Protect the identity_url field
Next protect the identity_url field, by adding the following to user.rb, account.rb or your custom user model:
1 attr_accessible :login, :email, :password, :password_confirmation, :identity_url
Add the following to the self.down method in 002_add_open_id_authentication_tables.rb:
1 remove_column :users, :identity_url
Integrating Open-id with the login page
Add the following to sessions/new.html.erb:
1 <label for="openid_url">OpenID URL</label><br /> 2 <%= text_field_tag "openid_url" %>
Make sure you’re showing flash messages, otherwise you won’t see the error messages:
1 <html> 2 <head></head> 3 <body> 4 <%= [:notice, :error].collect {|type| content_tag('div', flash[type], :id => type) if flash[type] } %> 5 6 <%= yield %> 7 </body> 8 </html>
Modifying the sessions controller
Copy & paste the following code in app/controllers/sessions_controller.rb:
1 class SessionsController < ApplicationController 2 # Hack to fix: No action responded to show 3 def show 4 create 5 end 6 7 def create 8 if using_open_id? 9 open_id_authentication(params[:openid_url]) 10 else 11 password_authentication(params[:login], params[:password]) 12 end 13 end 14 15 def destroy 16 self.current_user.forget_me if logged_in? 17 cookies.delete :auth_token 18 reset_session 19 flash[:notice] = "You have been logged out." 20 redirect_back_or_default('/') 21 end 22 23 protected 24 25 def open_id_authentication(openid_url) 26 authenticate_with_open_id(openid_url, :required => [:nickname, :email]) do |result, identity_url, registration| 27 if result.successful? 28 @user = User.find_or_initialize_by_identity_url(identity_url) 29 if @user.new_record? 30 @user.login = registration['nickname'] 31 @user.email = registration['email'] 32 @user.save(false) 33 end 34 self.current_user = @user 35 successful_login 36 else 37 failed_login result.message 38 end 39 end 40 end 41 42 def password_authentication(login, password) 43 self.current_user = User.authenticate(login, password) 44 if logged_in? 45 successful_login 46 else 47 failed_login 48 end 49 end 50 51 def failed_login(message = "Authentication failed.") 52 flash.now[:error] = message 53 render :action => 'new' 54 end 55 56 def successful_login 57 if params[:remember_me] == "1" 58 self.current_user.remember_me 59 cookies[:auth_token] = { :value => self.current_user.remember_token , :expires => self.current_user.remember_token_expires_at } 60 end 61 redirect_back_or_default('/') 62 flash[:notice] = "Logged in successfully" 63 end 64 end
OpenID authentication from behind a proxy
First, set the HTTP _PROXY environment variable to the proxy URL :
1 export HTTP_PROXY=http://proxy.aktagon.com:8080/
Then add the following to environment.rb:
1 OpenID::fetcher_use_env_http_proxy
How to install and use the restful_authentication Rails plugin
This is an adaptation of the restful_authentication screencast by Ryan Bates, which has an issue with Rails 2.0.3 that throws the following error:
1 NameError (uninitialized constant SessionsController): 2 /usr/local/lib/ruby/gems/1.8/gems/activesupport-2.0.2/lib/active_support/dependencies.rb:266:in `load_missing_constant' 3 /usr/local/lib/ruby/gems/1.8/gems/activesupport-2.0.2/lib/active_support/dependencies.rb:453:in `const_missing' 4 /usr/local/lib/ruby/gems/1.8/gems/activesupport-2.0.2/lib/active_support/dependencies.rb:465:in `const_missing' 5 /usr/local/lib/ruby/gems/1.8/gems/activesupport-2.0.2/lib/active_support/inflector.rb:257:in `constantize'
Installing the restful_authentication plugin
1 script/plugin source http://svn.techno-weenie.net/projects/plugins/ 2 script/plugin install restful_authentication
Generating the model and controller
1 script/generate authenticated user sessions
Now run the migration:
1 rake db:migrate
Configure routing
Open config/routes.rb and add the following routes:
1 map.resources :users 2 map.resource :session 3 4 map.signup '/signup', :controller => 'users', :action => 'new' 5 map.login '/login', :controller => 'sessions', :action => 'new' 6 map.logout '/logout', :controller => 'sessions', :action => 'destroy'
Include restful_authentication in ApplicationController
First remove these lines from the users and sessions controllers:
1 # Be sure to include AuthenticationSystem in Application Controller instead 2 include AuthenticatedSystem
Now include restful_authentication in the application controller:
1 class ApplicationController < ActionController::Base 2 include AuthenticatedSystem
Integrate restful_authentication with your views
First let’s create a controller and view by executing the generate script:
1 script/generate controller home index
Modify index.html.erb as follows:
1 <h1>Welcome</h1> 2 3 <% if logged_in? %> 4 <p><strong>You are logged in as <%=h current_user.login %></strong></p> 5 <p><%= link_to 'Logout', logout_path %></p> 6 <% else %> 7 <p><strong>You are currently not logged in.</strong></p> 8 <p> 9 <%= link_to 'Login', login_path %> or 10 <%= link_to 'Sign Up', signup_path %> 11 </p> 12 <% end %>
Start Rails and access your application. If needed, add the following to config/routes.rb to make the home controller the default:
1 map.root :controller => "home"
Login, sign up and logout should work.
How to use god to monitor a pack of mongrels
God is a monitoring framework written in Ruby that can be used for monitoring, for example, mongrel processes.
Installing god
Install god with the following command:
1 sudo gem install god
Configuring god
To configure god, first create a master configuration script by saving the following in /etc/god/god.rb:
1 # load in all god configs 2 God.load "/etc/god/conf/*.rb"
Now, save this configuration in /etc/god/conf/site.com.rb:
1 # 2 # Test this configuration file by executing: 3 # god -c /path_to_this_file -D 4 # 5 require 'yaml' 6 7 8 # 9 # Change these to match your project setup 10 # 11 APPLICATION = "xxx.com" 12 ROOT = "/var/www/#{APPLICATION}" # deployment directory 13 RAILS_ROOT = ROOT + '/current' # current release directory 14 MONGREL_CONF = ROOT + '/shared/mongrel_cluster.conf' # mongrel_cluster.conf file 15 16 # Read in mongrel_conf 17 OPTIONS = YAML.load_file(MONGREL_CONF) # Read mongrel configuration 18 19 # 20 # TODO This can be simplified 21 # 22 def ports(port, servers) 23 ports = [] 24 25 start_port = port 26 end_port = start_port + servers - 1 27 28 for port in start_port..end_port do 29 ports << port 30 end 31 32 ports 33 end 34 35 PORTS = ports(OPTIONS['port'].to_i, OPTIONS['servers'].to_i) 36 37 # 38 # Returns path of mongrel pid or log file: 39 # 40 # mongrel_path "/tmp/mongrel.pid", 9000 => "/tmp/mongrel.9000.pid" 41 # 42 def mongrel_path(file_path, port) 43 file_ext = File.extname(file_path) 44 file_base = File.basename(file_path, file_ext) 45 file_dir = File.dirname(file_path) 46 file = [file_base, port].join(".") + file_ext 47 48 File.join(file_dir, file) 49 end 50 51 # 52 # Returns the mongrel_rails start, stop or restart command depending on command parameter 53 # 54 def mongrel_rails(command, port) 55 raise "Unsupported command '#{command}'" if !['start', 'stop', 'restart'].include?(command) 56 57 argv = [ "mongrel_rails" ] 58 argv << command 59 argv << "-d" if command != 'stop' 60 argv << "-e #{OPTIONS['environment']}" if OPTIONS['environment'] && command != 'stop' 61 argv << "-a #{OPTIONS['address']}" if OPTIONS['address'] && command != 'stop' 62 argv << "-c #{OPTIONS['cwd']}" if OPTIONS['cwd'] 63 argv << "-f #{OPTIONS['force']}" if OPTIONS['force'] && command == 'stop' 64 argv << "-o #{OPTIONS['timeout']}" if OPTIONS['timeout'] && command != 'stop' 65 argv << "-t #{OPTIONS['throttle']}" if OPTIONS['throttle'] && command != 'stop' 66 argv << "-m #{OPTIONS['mime_map']}" if OPTIONS['mime_map'] && command != 'stop' 67 argv << "-r #{OPTIONS['docroot']}" if OPTIONS['docroot'] && command != 'stop' 68 argv << "-n #{OPTIONS['num_procs']}" if OPTIONS['num_procs'] && command != 'stop' 69 argv << "-B" if OPTIONS['debug'] && command != 'stop' 70 argv << "-S #{OPTIONS['config_script']}" if OPTIONS['config_script'] && command != 'stop' 71 argv << "--user #{OPTIONS['user']}" if OPTIONS['user'] && command != 'stop' 72 argv << "--group #{OPTIONS['group']}" if OPTIONS['group'] && command != 'stop' 73 argv << "--prefix #{OPTIONS['prefix']}" if OPTIONS['prefix'] && command != 'stop' 74 argv << "-p #{port}" if command != 'stop' 75 argv << '-P ' + mongrel_path(OPTIONS['pid_file'], port) 76 argv << '-l ' + mongrel_path(OPTIONS['log_file'], port) if command != 'stop' 77 78 cmd = argv.join " " 79 80 return cmd 81 end 82 83 PORTS.each do |port| 84 God.watch do |w| 85 w.name = "#{APPLICATION}-#{port}" 86 w.group = "mongrels" 87 w.interval = 30.seconds 88 w.start = mongrel_rails('start', port) 89 w.stop = mongrel_rails('stop', port) 90 w.restart = mongrel_rails('restart', port) 91 w.start_grace = 10.seconds 92 w.restart_grace = 10.seconds 93 w.pid_file = File.join(RAILS_ROOT, "/tmp/pids/mongrel.#{port}.pid") 94 95 w.behavior(:clean_pid_file) 96 97 w.start_if do |start| 98 start.condition(:process_running) do |c| 99 c.interval = 5.seconds 100 c.running = false 101 end 102 end 103 104 w.restart_if do |restart| 105 restart.condition(:memory_usage) do |c| 106 c.above = 150.megabytes 107 c.times = [3, 5] # 3 out of 5 intervals 108 end 109 110 restart.condition(:cpu_usage) do |c| 111 c.above = 50.percent 112 c.times = 5 113 end 114 end 115 116 # lifecycle 117 w.lifecycle do |on| 118 on.condition(:flapping) do |c| 119 c.to_state = [:start, :restart] 120 c.times = 5 121 c.within = 5.minute 122 c.transition = :unmonitored 123 c.retry_in = 10.minutes 124 c.retry_times = 5 125 c.retry_within = 2.hours 126 end 127 end 128 end 129 end
Add a script for each site you want to monitor.
Starting god
To start god execute:
1 god -c /etc/god/god.rb
For a list of available commands run god with the help switch:
1 $ god --help 2 Usage: 3 Starting: 4 god [-c <config file>] [-p <port> | -b] [-P <file>] [-l <file>] [-D] 5 6 Querying: 7 god <command> <argument> [-p <port>] 8 god <command> [-p <port>] 9 god -v 10 god -V (must be run as root to be accurate on Linux) 11 12 Commands: 13 start <task or group name> start task or group 14 restart <task or group name> restart task or group 15 stop <task or group name> stop task or group 16 monitor <task or group name> monitor task or group 17 unmonitor <task or group name> unmonitor task or group 18 remove <task or group name> remove task or group from god 19 load <file> load a config into a running god 20 log <task name> show realtime log for given task 21 status show status of each task 22 quit stop god 23 terminate stop god and all tasks 24 check run self diagnostic 25 26 Options: 27 -c, --config-file CONFIG Configuration file 28 -p, --port PORT Communications port (default 17165) 29 -b, --auto-bind Auto-bind to an unused port number 30 -P, --pid FILE Where to write the PID file 31 -l, --log FILE Where to write the log file 32 -D, --no-daemonize Don't daemonize 33 -v, --version Print the version number and exit 34 -V Print extended version and build information 35 --log-level LEVEL Log level [debug|info|warn|error|fatal] 36 --no-syslog Disable output to syslog 37 --attach PID Quit god when the attached process dies 38 --no-events Disable the event system 39 --bleakhouse Enable bleakhouse profiling
Surviving reboots
Save the following in /etc/init.d/god:
1 #!/bin/bash 2 # 3 # God 4 # 5 6 RETVAL=0 7 8 case "$1" in 9 start) 10 god -c /etc/god/god.rb -P /var/run/god.pid -l /var/log/god.log 11 RETVAL=$? 12 echo "God started" 13 ;; 14 stop) 15 kill `cat /var/run/god.pid` 16 RETVAL=$? 17 echo "God stopped" 18 ;; 19 restart) 20 kill `cat /var/run/god.pid` 21 god -c /etc/god/god.rb -P /var/run/god.pid -l /var/log/god.log 22 RETVAL=$? 23 echo "God restarted" 24 ;; 25 status) 26 RETVAL=$? 27 ;; 28 *) 29 echo "Usage: god {start|stop|restart|status}" 30 exit 1 31 ;; 32 esac 33 34 exit $RETVAL
Make the file executable with chmod:
1 chmod +x /etc/init.d/god
Tell Debian to run the script at startup:
1 sudo /usr/sbin/update-rc.d -f god defaults
How to install the stemmer4r gem on Mac OS X and Linux
The stemmer4r gem is fubar. Warning draft snippet…
1 # gem install stemmer4r 2 Bulk updating Gem source index for: http://gems.rubyforge.org 3 Building native extensions. This could take a while... 4 ERROR: While executing gem ... (Gem::Installer::ExtensionBuildError) 5 ERROR: Failed to build gem native extension. 6 7 ruby extconf.rb install stemmer4r 8 9 Gem files will remain installed in /usr/lib/ruby/gems/1.8/gems/stemmer4r-0.6 for inspection. 10 Results logged to /usr/lib/ruby/gems/1.8/gems/stemmer4r-0.6/ext/stemmer4r/gem_make.out 11 12 13 1. Change path of Ruby executable 14 15 cd /usr/lib/ruby/gems/1.8/gems/stemmer4r-0.6/ext/stemmer4r/ 16 vim extconf.rb 17 18 #!/usr/bin/ruby -w 19 20 to 21 22 #ruby -w 23 24 2. Compile libstemmer_c 25 26 cd /usr/lib/ruby/gems/1.8/gems/stemmer4r-0.6/ext/stemmer4r/libstemmer/ 27 make 28 29 3. Compile stemmer4r 30 31 cd /usr/lib/ruby/gems/1.8/gems/stemmer4r-0.6/ext/stemmer4r/ 32 33 Change path: 34 /usr/local/ruby/lib/ruby/1.8/i686-linux/ 35 To: 36 /usr/lib/ruby/1.8/x86_64-linux/ 37 38 Or wherever you have it installed 39 40 ruby extconf.rb 41 42 43 4. Build stemmer4r gem 44 45 46 gem build stemmer4r.gemspec 47 48 gem install stemmer4r-0.6.gem 49 50 51 Problems 52 53 gcc -shared -rdynamic -Wl,-export-dynamic -L"/usr/lib" -o stemmer4r.so stemmer4r.o libstemmer_c/libstemmer.o -lruby1.8 -lpthread -ldl -lcrypt -lm -lc 54 /usr/bin/ld: libstemmer_c/libstemmer.o(libstemmer.o): relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC 55 libstemmer_c/libstemmer.o: could not read symbols: Bad value 56 collect2: ld returned 1 exit status 57 make: *** [stemmer4r.so] Error 1 58 59 60 Add CFLAGS: 61 62 root@aktagon:/usr/lib/ruby/gems/1.8/gems/stemmer4r-0.6/ext/stemmer4r/libstemmer_c# make 63 include mkinc.mak 64 CFLAGS = -fPIC 65 libstemmer.o: $(snowball_sources:.c=.o) 66 $(AR) -cru $@ $^ 67