xml snippets

A simple and easy to use PHP XML parser

Tagged parser, xml, simple, php  Languages php

The PHP XML parser:

class XML
{
    static function parse($data, $handler, $encoding = "UTF-8")
    {
        $parser = xml_parser_create($encoding);

        xml_set_object($parser, $handler);
        
        xml_set_element_handler($parser,
            array(&$handler, 'start'),
            array(&$handler, 'end')
        );
            
        xml_set_character_data_handler(
            $parser,
            array($handler, 'content')
        );
            
        $result = xml_parse($parser, $data);

        if(!$result)
        {
            $error_string = xml_error_string(xml_get_error_code($parser));
            $error_line   = xml_get_current_line_number($parser);
            $error_column = xml_get_current_column_number($parser);
            
            $message = sprintf("XML error '%s' at line %d column %d", $error_string, $error_line, $error_column);
            
            throw new Exception($message);
        }

        xml_parser_free($parser);
    }
}

A result handler:

class ResultHandler
{
    var $tag;

    function start ($parser, $tagName, $attributes = null)
    {
        echo "start";
        $this->tag .= $tagName; # Use .= to work around bug...
    }

    function end ($parser, $tagName)
    {
        echo "end";
        $this->tag = null;

    }

    function content ($parser, $content)
    {
        echo "$this->tag: $content" ;
    }
}

Then in your code:

$xml = "<a>bah</a>";
XML::parse($xml, new ResultHandler());

Note that HTML/XML entities are considered to be tags by PHP's XML parser, so your start tag handler will be called three times for this tag, once for "really", once for "&" and once for " bad parser":

<data>really &amp;  bad parser</data>

I guess this is a bug... You can

How to parse OPML with Ruby

Tagged ruby, xml, parse, opml  Languages ruby

This example demonstrates how to parse OPML with Ruby.

First install the gem.

gem install opml

Then run this code:

require 'pp'
require 'rubygems'
require 'opml'

opml = Opml.new(File.read('opml.xml'))
pp opml

opml.outlines[0].attributes['xml_url']
opml.outlines[0].attributes['html_url']
opml.outlines[0].attributes['title']

How to parse XML with Python's built-in ElementTree parser

Tagged python, elementtree, xml, parse  Languages python
from xml.etree.ElementTree import fromstring, tostring

namespace = 'https://xxx.com/xxx'
element = fromstring(xml)

device = element.find('.//{%s}Device' % namespace)
detail = device.find('.//{%s}Details' % namespace)
series = device.findall('.//{%s}Series' % namespace)

Watch out for namespaces...

How to remove text between a tag from XML or HTML with SED

Tagged sed, html, xml, xmlstarlet  Languages bash

This will remove the users tag and everything in between the users tag:

sed -i .bak '/<users type="array">/,/<\/users>/d' users.xml

A backup will be created named users.xml.bak.

If you want to select only specific tags use this:

sed -n -e '/<private-parts>/,/<\/private-parts>/p' e users.xml

For more advanced XML processing use: * XMLStarlet * xml-coreutiles * xml2/2xml * Your imagination.

How to parse huge XML files with Ruby and Nokogiri (without using too much RAM)

Tagged big, xml, nokogiri, parse, huge  Languages ruby

Parse huge XML files, without using too much RAM, with Ruby and Nokogiri and the following code:

require 'nokogiri'
#  Public: BigXML helps you parse XML efficiently with minimal RAM usage. Parse 1GB, 2GB, 100GB, whatever and so on..
#
#  Examples:
#    # Filter an XML file efficiently by selecting only users, groups and messages.
#    File.open(ARGV[1], 'w') do |out_file|
#      xml = BigXML.new(ARGV[0])
#      xml.each_node do |node, path|
#        # users
#        if node.name == 'user' # or use the element's path: path == 'export/users/user'
#          out_file << node.outer_xml
#        # groups
#        elsif node.name == 'group' # or use the element's path and content: path == 'export/groups/group' && node.outer_xml.match(/<private type="boolean">false/m)
#          out_file << node.outer_xml
#        # messages
#        elsif node.name == 'message' # or use the element's path and content: path == 'export/messages/message' && node.outer_xml.match(/<private type="boolean">false/m)
#          out_file << node.outer_xml
#        end
#      end
#    end
#
class BigXML
  # Public: Initializes a parser.
  #
  # xml_file - The path of the XML file you want to parse
  def initialize(xml_file)
    raise ArgumentError, "Please provide the path of the XML file, not a #{xml_file.class}" unless xml_file.is_a?(String)
    @xml_file = xml_file
  end

  # Public: Iterate over each node in the XML document.
  #
  # attributes_in_path - Default false. Setting this to true will include attributes in the node path, e.g. /groups/@id=1. instead of just /groups
  #
  # Yields the node (Nokogiri::XML::Reader) and path (String) of the current XML node.
  #
  # Returns nothing.
  def each_node(attributes_in_path=false)
    reader = Nokogiri::XML::Reader(File.open(@xml_file))
    nodes = ['']
    reader.each do |node|
      # start tag
      if node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT.
        # store path
        if attributes_in_path && node.attributes.size > 0
          attributes = []
          node.attributes.sort.each do |name, value|
            attributes << "@#{name}=#{value}"
          end
          nodes << "#{node.name}/#{attributes.join('/')}"
        else
          nodes << node.name
        end
        path = nodes.join('/')
        yield node, path
      end
      # end tag
      if node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT || node.self_closing?
        nodes.pop
      end
    end
  end
end

if __FILE__ == $0
  require 'minitest/unit'
  class TestBigXML < MiniTest::Unit::TestCase
    def test_grep
      xml = BigXML.new(ARGV[0])
      users = 0
      groups = 0
      messages = 0
      File.open(ARGV[1], 'w') do |out_file|
        xml.each_node do |node, path|
          # users
          if node.name == 'user' && path == '/export/users/user'
            users += 1
            out_file << node.outer_xml
            out_file << "\n"
          # groups
          elsif node.name == 'group' && path == '/export/groups/group'.
            doc = Nokogiri::XML.parse(node.outer_xml)
            group = doc.at('/group/private')
            is_public = group && group.inner_text == 'false'
            if is_public
              groups += 1
              out_file << node.outer_xml
              out_file << "\n"
            end
          # messages
          elsif node.name == 'message' && path == '/export/messages/message'
            doc = Nokogiri::XML.parse(node.outer_xml)
            group = doc.at('/message/group/private')
            is_public = group && group.inner_text == 'false'
            if is_public
              messages += 1
              out_file << node.outer_xml
              out_file << "\n"
            end
          end
        end
        assert_equal 100, users
        assert_equal 100, groups
        assert_equal 100, messages
      end
    end
  end
  MiniTest::Unit.autorun
end

BigXML on Github

Sanitizing XML with Go's UnmarshalXML

Tagged unmarshal, go, xml  Languages go

Use a custom type to avoid this error when unmarshalling XML with Go:

strconv.ParseInt: parsing "86148865.00": invalid syntax

Example:

import (
    "encoding/xml"
    "strconv"
)

type DogHouse struct {
    //Count int xml:"dog_house>count" << this code will fail with "invalid syntax"
    Count sanitizedInt xml:"dog_house>count"
}

type sanitizedInt int

func (si *sanitizedInt) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error {
    var value string
    // Read tag content into value
    d.DecodeElement(&value, &start)
    // Remove "crap" and convert to int64
    i, err := strconv.ParseInt(strings.Replace(value, "crap", "", -1), 0, 64)
    if err != nil {
        return err
    }
    // Cast int64 to sanitizedInt
    *si = (sanitizedInt)(i)
    return nil
}

How to parse an XML document in Go

Tagged parser, encoding, golang, rss, go, xml  Languages go

This example shows how to fetch and parse an XML feed with Go.

Save this in main_test.go:

package main

import (
    "bytes"
    "code.google.com/p/go-charset/charset"
    _ "code.google.com/p/go-charset/data" // Import charset configuration files
    "encoding/xml"
    "io/ioutil"
    "log"
    "net/http"
    "testing"
)

type RssFeed struct {
    XMLName xml.Name  `xml:"rss"`
    Items   []RssItem `xml:"channel>item"`
}

type RssItem struct {
    XMLName     xml.Name `xml:"item"`
    Title       string   `xml:"title"`
    Link        string   `xml:"link"`
    Description string   `xml:"description"`
    //NestedTag    string      xml:">nested>tags>"
}

func fetchURL(url string) []byte {
    resp, err := http.Get(url)
    if err != nil {
        log.Fatalf("unable to GET '%s': %s", url, err)
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Fatalf("unable to read body '%s': %s", url, err)
    }
    return body
}

func parseXML(xmlDoc []byte, target interface{}) {
    reader := bytes.NewReader(xmlDoc)
    decoder := xml.NewDecoder(reader)
    // Fixes "xml: encoding \"windows-1252\" declared but Decoder.CharsetReader is nil"
    decoder.CharsetReader = charset.NewReader
    if err := decoder.Decode(target); err != nil {
        log.Fatalf("unable to parse XML '%s':\n%s", err, xmlDoc)
    }
}

func TestParseReport(t *testing.T) {
    var rssFeed = &RssFeed{}
    xmlDoc := fetchURL("https://news.ycombinator.com/rss")
    parseXML(xmlDoc, &rssFeed)
    for _, item := range rssFeed.Items {
        log.Printf("%s: %s", item.Title, item.Link)
    }
}

Run the code with go test.

How to sign XML documents using XMLDSig (XML Signature)

Tagged signature, xmldsig, xml, wtf, hl7fi, kanta, xmlsec1  Languages bash, xml

Install xmlsec1

sudo apt-get install xmlsec1

Create document

<?xml version="1.0" encoding="UTF-8"?>
<document>
  <hello>All XML is doomed to fail.</hello>
  <!-- Signature contains the signature definition -->
  <Signature xmlns="http://www.w3.org/2000/09/xmldsig#">
    <SignedInfo>
      <CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
      <SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/>
      <Reference>
        <Transforms>
          <Transform Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/>
          <Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
        </Transforms>
        <DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
        <DigestValue />
      </Reference>
      </SignedInfo>
    <SignatureValue />
    <KeyInfo>
      <X509Data />
    </KeyInfo>
  </Signature>
</document>

Sign document

xmlsec1 --sign --privkey-pem xxx.com.key,xxx.com.cer --output signed.xml tosign.xml

This example uses test certificates issued by VRK.

Verify document

xmlsec1 --verify --trusted-pem vrkthsp.pem --trusted-pem vrktestc.pem signed.xml

Note that a concatenated PEM file, i.e. cat vrkthsp.pem vrktestc.pem > concat.pem, does not work with xmlsec1.

How to specify which elements to sign with ds:Reference

Add one or more ds:Reference elements to specify which elements should be signed. Each element should have a unique ID in the URI attribute. The ID should be prefixed with a hash, e.g., #your-id:

<ds:Reference URI="#secret-xml-sauce">

Make sure your document contains an element having the exact ID without the hash prefix:

<Dog ID="secret-xml-sauce" name="Christian" />

Next, use the “—id-attr” switch to specify the element and attribute name:

xmlsec1 --sign --privkey-pem signing.key,signing.pem --id-attr:ID Dog --id-attr:ID structuredBody --output signed.xml tosign.xml

Note that “id” is the default attribute name. You only need —id-attr switch if you have the ID in an attribute having a different name.

How to sign multiple elements

Just add another “—id-attr: ” switch:

xmlsec1 --sign --privkey-pem signing.key,signing.pem --id-attr:ID signatureTimestamp --id-attr:ID structuredBody --output signed.xml tosign.xml

Then add another element having the given ID.

Troubleshooting

  • This error means you don’t have the correct trusted pem
certificate issuer check failed:err=2;msg=unable to get issuer certificate;issuer=/C=FI/ST=Finland/O=Vaestorekisterikeskus TEST/OU=Certification Authority Services/OU=Varmennepalvelut/CN=VRK TEST Root CA

To fix the error, take a hard look at the Issuer and Subject of all certificates in the certificate chain. For example:

openssl x509 -inform DER -in vrktestc.crt -text | grep "Issuer\|Subject"
  • xmlsec1 fails to find element containing ID
func=xmlSecXPathDataExecute:file=xpath.c:line=273:obj=unknown:subj=xmlXPtrEval:error=5:libxml2 library function failed:expr=xpointer(id('ID_OF_ELEMENT_TO_SIGN'))

Note that the XPATH queries are case sensitive. This means you might have to specify both the name of the element and the name of the ID attribute like this:

xmlsec1 sign --id-attr:ID elementThatYouWantToSign ...

For more solutions to issues, see sgros.blogspot.com: http://sgros.blogspot.com/2013/01/signing-xml-document-using-xmlsec1.html

Storing, querying, and indexing XML with Postgres

Tagged postgres, xbrl, xml, xpath  Languages sql, xml

Create table

CREATE TABLE xbrl_reports
(
  id serial primary key NOT NULL,
  doc xml NOT NULL,
  cik varchar(255) NOT NULL
);

Create function for importing XML

-- http://tapoueh.org/blog/2009/02/05-importing-xml-content-from-file
create or replace function xml_import(filename text)
  returns xml
  volatile
  language plpgsql as
$f$
    declare
        content bytea;
        loid oid;
        lfd integer;
        lsize integer;
    begin
        loid := lo_import(filename);
        lfd := lo_open(loid,262144);
        lsize := lo_lseek(lfd,0,2);
        perform lo_lseek(lfd,0,0);
        content := loread(lfd,lsize);
        perform lo_close(lfd);
        perform lo_unlink(loid);
 
        return xmlparse(document convert_from(content,'UTF8'));
    end;
$f$;

Import XML

-- Import XML file into Postgres
insert into xbrl_reports(doc, cik) values(xml_import('/Users/Christian/Downloads/2016q3/ibm-20160930.xml'), '1');

XML namespaces:

<xbrl
  xmlns="http://www.xbrl.org/2003/instance"
  xmlns:dei="http://xbrl.sec.gov/dei/2014-01-31"

Query data

-- Check if dei::TradingSymbol exists => t
SELECT xpath('//xbrl:xbrl/dei:TradingSymbol/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}') from xbrl_reports;

-- Extract dei:TradingSymbol by declaring dei namespace => {IBM}
SELECT xpath('//xbrl:xbrl/dei:TradingSymbol/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}') from xbrl_reports;

-- Extract dei:TradingSymbol by adding ((...)[1]::text) => IBM
SELECT ((xpath('//xbrl:xbrl/dei:TradingSymbol/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}'))[1]::text) from xbrl_reports;

Index data

-- Create index for faster lookups
create index xbrl_reports_ticker_idx on xbrl_reports using btree ((( xpath('//xbrl:xbrl/dei:TradingSymbol/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}') )[1]::text)); 

Materialized views for performance

CREATE MATERIALIZED VIEW company_reports AS SELECT
  ((xpath('//xbrl:xbrl/dei:TradingSymbol/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}'))[1]::text) as ticker,
  ((xpath('//xbrl:xbrl/dei:EntityRegistrantName/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}'))[1]::text) as name,
  ((xpath('//xbrl:xbrl/dei:DocumentType/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}'))[1]::text) as document_type,
  ((xpath('//xbrl:xbrl/dei:DocumentPeriodEndDate/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}'))[1]::text) as quarter,
  ((xpath('//xbrl:xbrl/dei:EntityCommonStockSharesOutstanding/text()', doc, '{{xbrl,http://www.xbrl.org/2003/instance},{dei,http://xbrl.sec.gov/dei/2014-01-31}}'))[1]::text) as shares_outstanding
FROM xbrl_reports;