Earlier this evening was a presentation at the San Francisco PM group, on “Unicode & Everything”. I wanted to go, but had a conflict so I had to miss it.
Character encoding is an area I’m weak in, and one that I need to be better at. My biggest module, RPC::XML, supposedly supports encodings other than
us-ascii but in truth it’s pretty broken. I recently applied a patch that fixes the handling of UTF-8 content, but that’s not what I need. What I need is for it to properly handle content in (theoretically) any encoding. I don’t think that a talk focused on Unicode would have covered that, but I was hoping that I might be able to corner the speaker afterwards to bounce some questions off of him.
What it comes down to, is this: my library creates requests and responses in well-formed XML, complete with an
encoding attribute in the declaration line:
<?xml version="1.0" encoding="us-ascii"?> <methodCall> <methodName>someName</methodName> <params> <param> <value><string>Some string data</string></value> </param> <param> <value><int>42</int></value> </param> </params> </methodCall>
What matters here is not the structure of XML (in this case)– it’s the
encoding="us-ascii" part, and this line:
<value><string>Some string data</string></value>
See, my library generates the XML around the “
Some string data“, but the string data itself comes from whatever the user provides, and the user expects that to be in the encoding he or she specified. And here is where I start to get confused: I know that the boilerplate code is US-ASCII (in the range that makes it passable as UTF-8), but I suspect that I can’t just paste in a string encoded in ShiftJIS and slap on
encoding="shiftjis" in the XML declaration. Or can I?
XML-RPC has a very limited vocabulary and set of data-types. The character range, funny-encoded-strings notwithstanding, is just basic ASCII. You have the tags, then strings, integers, doubles, date/time values (ISO8601) and base-64 data. Regardless of encoding, all of these except the strings stay in the ASCII range.
So for those reading this that are more adept as working with encodings than I, how to approach this? Is the magic sauce somewhere in Perl’s Encode module? I really want to get this part of the RPC::XML module working right, so I can move on to the next big hassle, data compression…
(I also need to figure out why my code-highlighting plugin isn’t doing its job…)
(I think I got it now. Something was missing from one of the theme files. Gotta love WordPress/PHP…)