utf-8 snippets

How to write UTF-8 data to an Oracle BLOB column with Java and JDBC

Tagged oracle, blob, jdbc, java, utf-8  Languages java

This example works with Oracle:

private Blob getBlob(Connection connection, String data)
{
  BLOB blob = BLOB.createTemporary(connection, true, BLOB.DURATION_SESSION);

  try
  {
      blob.open(BLOB.MODE_READWRITE);
      blob.putBytes(1, data.getBytes("UTF-8")); // Consider streaming, if data size is unknown. Note that setBytes doesn't work
  }
  catch(UnsupportedEncodingException ex)
  {
      throw new RuntimeException("Unable to get a blob for '" + data + "'", ex);
  }
  catch(SQLException ex)
  {
      throw new RuntimeException("Unable to get a blob for '" + data + "'", ex);
  }
  finally
  {
      try { if(blob != null) blob.close(); } catch(Exception ex) {};
  }
}

Then use the method like this:

Connection connection = getConnection();
PreparedStatement statement = getPreparedStatement(yer sequel);

statement.setBlob(1, getBlob(connection,  <Mao's Little Red Book>));

How to read UTF-8 data from an Oracle BLOB column with Java and JDBC

Tagged oracle, blob, utf-8, java  Languages java

This example works with Oracle:

private String getBlobAsString(Blob blob)
{
    StringBuffer result = new StringBuffer();
    
    if ( blob != null ) 
    {
        int read = 0;
        Reader reader = null;
        char[] buffer = new char[1024];
                                
        try
        {
            reader = new InputStreamReader(blob.getBinaryStream(), "UTF-8");

            while((read = reader.read(buffer)) != -1) 
            {
                result.append(buffer, 0, read);
            }
        }
        catch(SQLException ex)
        {
            throw new RuntimeException("Unable to read blob data.", ex);
        }
        catch(IOException ex)
        {
            throw new RuntimeException("Unable to read blob data.", ex);
        }
        finally
        {
            try { if(reader != null) reader.close(); } catch(Exception ex) {};
        }
    }
    
    return result.toString();
}

Then use the method like this:

ResultSet resultSet = your JDBC result set;

String utf8 = getBlobAsString(resultSet.getBlob("xml"));

Parsing feeds with Ruby and the FeedTools gem

Tagged feedtools, rss, atom, parser, ruby, content encoding, utf-8, iso-8859-1  Languages ruby

This is an example of how to use the FeedTools gem to parse a feed. FeedTools supports atom, rss, and so on...

The only negative thing about FeedTools is that the project is abandoned, the author said this in a comment from March 2008: “I’ve effectively abandoned it, so I’m really not going to go taking on huge code reorganization efforts.”

Installing

$ sudo gem install feedtools

Fetching and parsing a feed

Easy...

require 'rubygems'
require 'feed_tools'
feed = FeedTools::Feed.open('http://www.slashdot.org/index.rss')

puts feed.title
puts feed.link
puts feed.description

for item in feed.items
  puts item.title
  puts item.link
  puts item.content
end

Feed autodiscovery

FeedTools finds the Slashdot feed for you.

puts FeedTools::Feed.open('http://www.slashdot.org').href

Helpers

FeedTools can also cleanup your dirty XML/HTML:

require 'feed_tools'
require 'feed_tools/helpers/feed_tools_helper'

FeedTools::HtmlHelper.tidy_html(html)

Database cache

FeedTools can also store the fetched feeds for you:

FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"

The schema contains all you need:

-- Example MySQL schema
  CREATE TABLE cached_feeds (
    id              int(10) unsigned NOT NULL auto_increment,
    href            varchar(255) default NULL,
    title           varchar(255) default NULL,
    link            varchar(255) default NULL,
    feed_data       longtext default NULL,
    feed_data_type  varchar(20) default NULL,
    http_headers    text default NULL,
    last_retrieved  datetime default NULL,
    time_to_live    int(10) unsigned NULL,
    serialized       longtext default NULL,
    PRIMARY KEY  (id)
  )

There's even a Rails migration file included.

Feed updater

There's also a feed updater tool that can fetch feeds in the background, but I haven't had time to look at it yet.

sudo gem install feedupdater

Character set/encoding bug

As always, there are bugs that you need to be aware of, Feedtools is no different. There's an encoding bug, FeedTools encodes everything to ISO-8859-1, instead UTF-8 which should be the default encoding.

To fix it use the following code:

ic = Iconv.new('ISO-8859-1', 'UTF-8')
feed.description = ic.iconv(feed.description)

You can also try this patch.

cd /usr/local/lib/ruby/gems/1.8/gems/
wget http://n0life.org/~julbouln/feedtools_encoding.patch
patch -p1 feedtools_encoding.patch

The character encoding bug is discussed on this page: http://sporkmonger.com/2005/08/11/tutorial

Time estimation

By default FeedTools will try to estimate when a feed item was published, if it's not available from the feed. This annoys me and will create weird publish dates, so usually it's a good idea to disable it with the timestamp_estimation_enabled option:

FeedTools.reset_configurations
FeedTools.configurations[:tidy_enabled] = false
FeedTools.configurations[:feed_cache] = nil
FeedTools.configurations[:default_ttl]   = 15.minutes
FeedTools.configurations[:timestamp_estimation_enabled] = false

Configuration options

To see a list of available configuration options run the following code:

pp FeedTools.configurations

How to use Ruby and SimpleRSS to parse RSS and Atom feeds

Tagged rss, atom, parse, ruby, simplerss, encoding, utf-8  Languages ruby

This script is an example of how to use the SimpleRSS gem to parse an RSS feed.

The script can easily be modified to support conditional gets. It also detects the feed's character encoding and converts the feed to UTF-8.

require 'iconv'
require 'net/http'
require 'net/https'
require 'rubygems'
require 'simple-rss'

url = URI.parse('http://hbl.fi/rss.xml')

http = Net::HTTP.new(url.host, url.port)

http.open_timeout = http.read_timeout = 10  # Set open and read timeout to 10 seconds
http.use_ssl = (url.scheme == "https")

headers = {
  'User-Agent'          => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
  'If-Modified-Since'   => 'store in a database and set on each request',
  'If-None-Match'       => 'store in a database and set on each request'
}

response, body = http.get(url.path, headers)

encoding = body.scan(
/^<\?xml [^>]*encoding="([^\"]*)"[^>]*\?>/
).flatten.first

if encoding.empty?
    if response["Content-Type"] =~ /charset=([\w\d-]+)/
        puts "Feed #{url} is #{encoding} according to Content-Type header"
        encoding = $1.downcase
    else
        puts "Unable to detect content encoding for #{href}, using default."
        encoding = "ISO-8859-1"
    end
else
    puts "Feed #{url} is #{encoding} according to XML"
end

# Use 'UTF-8//IGNORE', if this throws an exception
ic = Iconv.new('UTF-8', encoding)
body = ic.iconv(body)

feed = SimpleRSS.parse(body)

for item in feed.items
  puts item.title
end

How to fix Internet Explorer and Firefox form encoding issues when posting data to a server having a different encoding

Tagged utf-8, iso-8859-1, internet explorer, firefox, encoding, form  Languages html

This snippet explains how to fix Internet Explorer and Firefox form encoding issues when posting data to a server having a different encoding than the source system.

This happens, for example, when you host a form on a server using ISO-8859-1 that posts data to a server using UTF-8.

The fix for Firefox (and Opera and other sensible browsers) is to use the accept-charset attribute:

<form ...  accept-charset="utf-8">

The fix for Internet Explorer is to use a hack:

<form ...  accept-charset="utf-8">
  <input type="hidden" name="enc" value="&#153;">
</form>

The hidden input field will make Internet Explorer understand that you want it to support UTF-8.