Generate a 56-bit DES encrypted (htpasswd) password with Ruby
Run the following in an irb console to generate a 56-bit DES encrypted password:
1 "password".crypt("salt")
The password can be used in an Apache or Nginx htpasswd file to enable basic authentication.
The generated password can also be used in other Unix password files.
Password protecting a folder/resource with Nginx
First add the following to your Nginx configuration file:
1 location / { 2 auth_basic "Restricted"; 3 auth_basic_user_file /etc/nginx/htpasswd; 4 }
Then create the htpasswd file:
1 # this be passwords 2 thisbetheusername:thisbeencryptedpass:yercomment
To generate a htpasswd password without installing Apache you can use the following Perl or Ruby code:
Perl
1 perl -le 'print crypt("password", "salt")'
Ruby (run in irb)
1 "password".crypt("salt")
The crypt() method uses 56-bit DES encryption, which is used in /etc/passwd and htpasswd.
Valid RSS 2.0 Feed Template for Rails
Here’s the template, modify it to fit your needs. I know there are plugins and other ways of doing this, but I hate code that gets too abstract:
1 <?xml version="1.0"?> 2 <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> 3 <channel> 4 <atom:link href="http://xxxxxxx" rel="self" type="application/rss+xml" /> 5 <title>Code Snippets - Aktagon</title> 6 <link>http://snippets.aktagon.com/</link> 7 <description>Share your code with the world. Allow others to review and comment.</description> 8 <language>en-us</language> 9 <pubDate><%= @snippets[0].created_at.rfc822 %></pubDate> 10 <lastBuildDate><%= @snippets[0].created_at.rfc822 %></lastBuildDate> 11 <docs>http://blogs.law.harvard.edu/tech/rss</docs> 12 <generator>Aktagon Snippets</generator> 13 <% for snippet in @snippets %> 14 <item> 15 <title><![CDATA[<%= snippet.title %>]]></title> 16 <link><%= snippet_url(snippet) %></link> 17 <description><![CDATA[<%= snippet.rendered_body %>]]></description> 18 <pubDate><%= @snippets[0].created_at.rfc822 %></pubDate> 19 <guid><%= snippet_url(snippet) %></guid> 20 <% for tag in snippet.tags%> 21 <category domain="http://snippets.aktagon.com/snippets"><![CDATA[<%= tag.name %>]]></category> 22 <% end%> 23 </item> 24 <% end %> 25 </channel> 26 </rss> 27
Remember to serve the feed with the correct HTTP headers.
It also helps to have an auto-discovery tag inside the head tag:
1 <link rel="alternate" type="application/rss+xml" title="RSS feed" href="http://<%= request.host %>/rss/" />
List all links in a file with Ruby and regular expressions
This snippet lists all text links:
1 data = File.read('the_link_collection.txt') 2 3 links = data.scan /href="([^"]*)[^>]*>([^<]*)<\/a>/im 4 5 links.each do |link| 6 puts "#{link[1].chomp} = #{link[0]}" 7 end
Scraping Google search results with Scrubyt and Ruby
Note that these instructions don’t work with the latest Scrubyt version…
Scrubyt is a Ruby library that allows you to easily scrape the contents of any site.
First install Scrubyt:
1 $ sudo gem install mechanize hpricot parsetree ruby2ruby scrubyt
You also need to install ReadLine version 3.6.3:
1 sudo gem install -v 3.6.3 RubyInline
If you install the wrong RubyInline version or have multiple versions installed, you’ll get the following error:
1 /usr/lib/ruby/1.8/rubygems.rb:207:in `activate': can't activate RubyInline (= 3.6.3), already activated RubyInline-3.6.6] (Gem::Exception) 2 from /usr/lib/ruby/1.8/rubygems.rb:225:in `activate' 3 from /usr/lib/ruby/1.8/rubygems.rb:224:in `each' 4 from /usr/lib/ruby/1.8/rubygems.rb:224:in `activate' 5 from /usr/lib/ruby/1.8/rubygems/custom_require.rb:32:in `require' 6 from t2:2
To fix it first uninstall the latest version, and keep only version 3.6.3:
1 sudo gem uninstall RubyInline 2 3 Select RubyGem to uninstall: 4 1. RubyInline-3.6.3 5 2. RubyInline-3.6.6 6 3. All versions 7 > 2
Scraping Google search results
Then run this to Scrape the first two pages of the Google results for ruby:
1 require 'rubygems' 2 require 'scrubyt' 3 4 # See http://scrubyt.org/example-specification-from-the-page-known-issues-and-pitfalls/ 5 6 # Create a learning extractor 7 data = Scrubyt::Extractor.define do 8 fetch('http://www.google.com/') 9 fill_textfield 'q', 'ruby' 10 submit 11 12 # Teach Scrubyt what we want to retrieve 13 # In this case we want Scruby to find all search results 14 # and "Ruby Programming Language" happens to be the first 15 # link in the result list. Change "Ruby Programming Language" 16 # to whatever you want Scruby to find. 17 link do 18 name "Ruby Programming Language" 19 url "href", :type => :attribute 20 end 21 22 # Click next until we're on the second page. 23 next_page "Next", :limit => 2 24 end 25 26 # Print out what Scruby found 27 puts data.to_xml 28 29 puts "Your production scraper has been created: data_extractor_export.rb." 30 31 # Export the production version of the scraper 32 data.export(__FILE__)
Learning Extractor vs Production extractor
Note that this example uses the Learning Extractor functionality of Scrubyt.
The production extractor is generated with the last line:
1 data.export(__FILE__)
If you open the production extractor in an editor you’ll see that it uses XPath queries to extract the content:
1 link("/html/body/div/div/div/h2", { :generalize => true }) do 2 name("/a[1]") 3 url("href", { :type => :attribute }) 4 end
Finding the correct XPath
The learning mode is pretty good at finding the XPath of HTML elements, but if you have difficulties getting Scrubyt to extract exactly what you want, simply install Firebug and use the Inspect feature to select the item you want to extract the value from. Then right-click on it in the Firebug window and choose Copy XPath.
Note that there’s a gotcha when copying the XPath of an element with Firebug. Firebug uses Firefox’s internal and normalized DOM model, which might not match match the real-world HTML structure. For example the tbody tag is usually added by Firefox/Firebug, and should be removed if it isn’t in the HTML.
Another option that I haven’t tried myself is to use the XPather extension.
Using hpricot to find the XPath
If you’re really having problems finding the right XPath of an element, you can also use HPricot to find it. In this example the code prints out the XPath to all table columns containing the text 51,999:
1 require 'rexml/document' 2 require 'hpricot' 3 require 'open-uri' 4 5 url = "http://xyz" 6 7 page = Hpricot(open(url, 8 'User-Agent' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12', 9 'Referer' => 'http://xyz' 10 )) 11 12 page.search( "//td:contains('51,992')" ).each do |row| 13 puts row.xpath() 14 end
The output from the above snippet looks something like this:
1 /html/body/table[2]/tr[2]/td[3] 2 /html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1] 3 /html/body/table[2]/tr[2]/td[3]/table[4]/tr[1]/td[1]/table[1]/tr[2]/td[2]
Note that sometimes I find that hrpicot is easier to use than Scrubyt, so use what’s best for you.
Miscellaneous problems
The following problem can be solved by following the instructions found here:
1 Your production scraper has been created: data_extractor_export.rb. 2 /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in `extend': wrong argument type Class (expected Module) (TypeError) 3 from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:129:in `to_sexp' 4 from /var/lib/gems/1.8/gems/ParseTreeReloaded-0.0.1/lib/parse_tree_reloaded.rb:93:in `parse_tree_for_method' 5 from /var/lib/gems/1.8/gems/ruby2ruby-1.1.6/lib/ruby2ruby.rb:1063:in `to_sexp'