encoding snippets

Detecting file/data encoding with Ruby and the chardet RubyGem

Tagged detect, charset, encoding, ruby, chardet  Languages ruby

You can use the chardet gem to detect the charset of an arbitrary string.

Install the chardet gem by issuing the following command:

$ sudo gem install chardet

Then in irb:

require 'rubygems'
require 'UniversalDetector'
p UniversalDetector::chardet('Ascii text')
p UniversalDetector::chardet('åäö')

The output from this example is:

{"encoding"=>"ascii", "confidence"=>1.0}
{"encoding"=>"utf-8", "confidence"=>0.87625}

For Python users there exists an identical library...

How to use Ruby and SimpleRSS to parse RSS and Atom feeds

Tagged rss, atom, parse, ruby, simplerss, encoding, utf-8  Languages ruby

This script is an example of how to use the SimpleRSS gem to parse an RSS feed.

The script can easily be modified to support conditional gets. It also detects the feed's character encoding and converts the feed to UTF-8.

require 'iconv'
require 'net/http'
require 'net/https'
require 'rubygems'
require 'simple-rss'

url = URI.parse('http://hbl.fi/rss.xml')

http = Net::HTTP.new(url.host, url.port)

http.open_timeout = http.read_timeout = 10  # Set open and read timeout to 10 seconds
http.use_ssl = (url.scheme == "https")

headers = {
  'User-Agent'          => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
  'If-Modified-Since'   => 'store in a database and set on each request',
  'If-None-Match'       => 'store in a database and set on each request'
}

response, body = http.get(url.path, headers)

encoding = body.scan(
/^<\?xml [^>]*encoding="([^\"]*)"[^>]*\?>/
).flatten.first

if encoding.empty?
    if response["Content-Type"] =~ /charset=([\w\d-]+)/
        puts "Feed #{url} is #{encoding} according to Content-Type header"
        encoding = $1.downcase
    else
        puts "Unable to detect content encoding for #{href}, using default."
        encoding = "ISO-8859-1"
    end
else
    puts "Feed #{url} is #{encoding} according to XML"
end

# Use 'UTF-8//IGNORE', if this throws an exception
ic = Iconv.new('UTF-8', encoding)
body = ic.iconv(body)

feed = SimpleRSS.parse(body)

for item in feed.items
  puts item.title
end

How to fix Internet Explorer and Firefox form encoding issues when posting data to a server having a different encoding

Tagged utf-8, iso-8859-1, internet explorer, firefox, encoding, form  Languages html

This snippet explains how to fix Internet Explorer and Firefox form encoding issues when posting data to a server having a different encoding than the source system.

This happens, for example, when you host a form on a server using ISO-8859-1 that posts data to a server using UTF-8.

The fix for Firefox (and Opera and other sensible browsers) is to use the accept-charset attribute:

<form ...  accept-charset="utf-8">

The fix for Internet Explorer is to use a hack:

<form ...  accept-charset="utf-8">
  <input type="hidden" name="enc" value="&#153;">
</form>

The hidden input field will make Internet Explorer understand that you want it to support UTF-8.

How to parse an XML document in Go

Tagged go, xml, encoding, golang, rss, parser  Languages go

This example shows how to fetch and parse an XML feed with Go.

Save this in main_test.go:

package main

import (
    "bytes"
    "code.google.com/p/go-charset/charset"
    _ "code.google.com/p/go-charset/data" // Import charset configuration files
    "encoding/xml"
    "io/ioutil"
    "log"
    "net/http"
    "testing"
)

type RssFeed struct {
    XMLName xml.Name  `xml:"rss"`
    Items   []RssItem `xml:"channel>item"`
}

type RssItem struct {
    XMLName     xml.Name `xml:"item"`
    Title       string   `xml:"title"`
    Link        string   `xml:"link"`
    Description string   `xml:"description"`
    //NestedTag    string      xml:">nested>tags>"
}

func fetchURL(url string) []byte {
    resp, err := http.Get(url)
    if err != nil {
        log.Fatalf("unable to GET '%s': %s", url, err)
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Fatalf("unable to read body '%s': %s", url, err)
    }
    return body
}

func parseXML(xmlDoc []byte, target interface{}) {
    reader := bytes.NewReader(xmlDoc)
    decoder := xml.NewDecoder(reader)
    // Fixes "xml: encoding \"windows-1252\" declared but Decoder.CharsetReader is nil"
    decoder.CharsetReader = charset.NewReader
    if err := decoder.Decode(target); err != nil {
        log.Fatalf("unable to parse XML '%s':\n%s", err, xmlDoc)
    }
}

func TestParseReport(t *testing.T) {
    var rssFeed = &RssFeed{}
    xmlDoc := fetchURL("https://news.ycombinator.com/rss")
    parseXML(xmlDoc, &rssFeed)
    for _, item := range rssFeed.Items {
        log.Printf("%s: %s", item.Title, item.Link)
    }
}

Run the code with go test.