| Subcribe via RSS

Understanding Encoding (Trying To, At Least)

March 3rd, 2014 | 2 Comments | Posted in Perl, XML

Earlier this evening was a presentation at the San Francisco PM group, on “Unicode & Everything”. I wanted to go, but had a conflict so I had to miss it.

Character encoding is an area I’m weak in, and one that I need to be better at. My biggest module, RPC::XML, supposedly supports encodings other than us-ascii but in truth it’s pretty broken. I recently applied a patch that fixes the handling of UTF-8 content, but that’s not what I need. What I need is for it to properly handle content in (theoretically) any encoding. I don’t think that a talk focused on Unicode would have covered that, but I was hoping that I might be able to corner the speaker afterwards to bounce some questions off of him.

What it comes down to, is this: my library creates requests and responses in well-formed XML, complete with an encoding attribute in the declaration line:

<?xml version="1.0" encoding="us-ascii"?>
<methodCall>
    <methodName>someName</methodName>
    <params>
        <param>
            <value><string>Some string data</string></value>
        </param>
        <param>
            <value><int>42</int></value>
        </param>
    </params>
</methodCall>

What matters here is not the structure of XML (in this case)– it’s the encoding="us-ascii" part, and this line:

<value><string>Some string data</string></value>

See, my library generates the XML around the “Some string data“, but the string data itself comes from whatever the user provides, and the user expects that to be in the encoding he or she specified. And here is where I start to get confused: I know that the boilerplate code is US-ASCII (in the range that makes it passable as UTF-8), but I suspect that I can’t just paste in a string encoded in ShiftJIS and slap on encoding="shiftjis" in the XML declaration. Or can I?

XML-RPC has a very limited vocabulary and set of data-types. The character range, funny-encoded-strings notwithstanding, is just basic ASCII. You have the tags, then strings, integers, doubles, date/time values (ISO8601) and base-64 data. Regardless of encoding, all of these except the strings stay in the ASCII range.

So for those reading this that are more adept as working with encodings than I, how to approach this? Is the magic sauce somewhere in Perl’s Encode module? I really want to get this part of the RPC::XML module working right, so I can move on to the next big hassle, data compression…

(I also need to figure out why my code-highlighting plugin isn’t doing its job…)

(I think I got it now. Something was missing from one of the theme files. Gotta love WordPress/PHP…)

Tags: , ,

Idle Thoughts on Parsing XML (slightly Perlish)

October 7th, 2009 | No Comments | Posted in Perl, XML

(Side note: There was no Module Monday post this week, as I was too swamped to look for one to cover. Check back next week…)

I’m in the (achingly slow) process of writing a new XML-RPC parser using XML::LibXML. Because (according to their own docs) their SAX support is spotty, I’m letting the library parse the whole message into a DOM object and then using that object to get the request or response. This has proven to be a serious pain in the lower regions.

The XML::Parser approach I’ve had since RPC::XML’s inception is an event-based parser: I use a state-machine/stack approach and push/pop items as needed, based on whether my event is a tag-start, tag-end, text, etc. As a side effect, I validate the document, since the stack/state machine will throw an exception if some event doesn’t fit in to what it is expecting.

Taking a DOM approach means more work, as not only am I drilling down for the data I need, I also have to do some checking for validity as well. (Some might point out that XML::LibXML supports checking document validity against any of a DTD, XML Schema or RelaxNG schema… I’m actually familiar with that. But there is no “real” (i.e., “official”) DTD or schema for XML-RPC for me to use in this case.)

So here’s my observation, which is probably blindingly-obvious to everyone else who’s worked with XML: SAX/event-based parsing is the way to go for processing a whole document, and DOM is better for cherry-picking pieces from different parts of it.

Like I said, probably pretty obvious to the rest of you, but it’s hitting me over the head pretty hard these days.

Tags: , ,

OSCON Day 2, Part 1: Lightning Talks

July 23rd, 2009 | No Comments | Posted in ChangeLogML, Conferences, Perl, XML, XSLT

I will be giving a lightning talk (possibly two) in the Perl Lightning Talks session at 4:30. The talk I am definitely doing is on ChangeLogML and the Perl module I have for it. If their time and schedule permit, I will also give a talk on my testing module, Test::Formats.

Tags: , , , ,