| Subcribe via RSS

Understanding Encoding (Trying To, At Least)

March 3rd, 2014 Posted in Perl, XML

Earlier this evening was a presentation at the San Francisco PM group, on “Unicode & Everything”. I wanted to go, but had a conflict so I had to miss it.

Character encoding is an area I’m weak in, and one that I need to be better at. My biggest module, RPC::XML, supposedly supports encodings other than us-ascii but in truth it’s pretty broken. I recently applied a patch that fixes the handling of UTF-8 content, but that’s not what I need. What I need is for it to properly handle content in (theoretically) any encoding. I don’t think that a talk focused on Unicode would have covered that, but I was hoping that I might be able to corner the speaker afterwards to bounce some questions off of him.

What it comes down to, is this: my library creates requests and responses in well-formed XML, complete with an encoding attribute in the declaration line:

<?xml version="1.0" encoding="us-ascii"?>
<methodCall>
    <methodName>someName</methodName>
    <params>
        <param>
            <value><string>Some string data</string></value>
        </param>
        <param>
            <value><int>42</int></value>
        </param>
    </params>
</methodCall>

What matters here is not the structure of XML (in this case)– it’s the encoding="us-ascii" part, and this line:

<value><string>Some string data</string></value>

See, my library generates the XML around the “Some string data“, but the string data itself comes from whatever the user provides, and the user expects that to be in the encoding he or she specified. And here is where I start to get confused: I know that the boilerplate code is US-ASCII (in the range that makes it passable as UTF-8), but I suspect that I can’t just paste in a string encoded in ShiftJIS and slap on encoding="shiftjis" in the XML declaration. Or can I?

XML-RPC has a very limited vocabulary and set of data-types. The character range, funny-encoded-strings notwithstanding, is just basic ASCII. You have the tags, then strings, integers, doubles, date/time values (ISO8601) and base-64 data. Regardless of encoding, all of these except the strings stay in the ASCII range.

So for those reading this that are more adept as working with encodings than I, how to approach this? Is the magic sauce somewhere in Perl’s Encode module? I really want to get this part of the RPC::XML module working right, so I can move on to the next big hassle, data compression…

(I also need to figure out why my code-highlighting plugin isn’t doing its job…)

(I think I got it now. Something was missing from one of the theme files. Gotta love WordPress/PHP…)

Tags: , ,

2 Responses to “Understanding Encoding (Trying To, At Least)”

  1. chansenNo Gravatar Says:

    There is no magic sauce! From a quick scan of the documentation, I couldn’t find any mention about encodings from the “producing” side, based on that here is what I would do:

    1) Document that you accept strings (sequence of characters) and not octets (encoded strings) in your string API (In Perl we can’t make a distinction between characters and octets)
    2) Let the user optionally provide the desired encoding
    3) If the user provided an encoding, use Encode::find_encoding($name) and encode the document and adjust the xml declaration appropriately
    4) if the user didn’t provide an explicit encoding use UTF-8 and adjust the xml declaration appropriately.

    Regards
    chansen


  2. Joseph BrennerNo Gravatar Says:

    Do you understand that UTF-8 is entirely backwards compatible with 7-bit ASCII? That’s one of the nice things about it. If you changed your encoding to UTF-8 you shouldn’t notice any change at all in anything except what people can get away with embedding in the XML as strings.

    And the way that perl handles encodings is at the boundaries of the program: there’s one internal encoding in any perl program: if you’re passed a string then it’s just a string, even if the client programmer decoded it from Shift-JIS. You’re problem is just encoding output as UTF-8.


Leave a Reply