Idle Thoughts on Parsing XML (slightly Perlish)

October 7th, 2009 Posted in Perl, XML

(Side note: There was no Module Monday post this week, as I was too swamped to look for one to cover. Check back next week…)

I’m in the (achingly slow) process of writing a new XML-RPC parser using XML::LibXML. Because (according to their own docs) their SAX support is spotty, I’m letting the library parse the whole message into a DOM object and then using that object to get the request or response. This has proven to be a serious pain in the lower regions.

The XML::Parser approach I’ve had since RPC::XML’s inception is an event-based parser: I use a state-machine/stack approach and push/pop items as needed, based on whether my event is a tag-start, tag-end, text, etc. As a side effect, I validate the document, since the stack/state machine will throw an exception if some event doesn’t fit in to what it is expecting.

Taking a DOM approach means more work, as not only am I drilling down for the data I need, I also have to do some checking for validity as well. (Some might point out that XML::LibXML supports checking document validity against any of a DTD, XML Schema or RelaxNG schema… I’m actually familiar with that. But there is no “real” (i.e., “official”) DTD or schema for XML-RPC for me to use in this case.)

So here’s my observation, which is probably blindingly-obvious to everyone else who’s worked with XML: SAX/event-based parsing is the way to go for processing a whole document, and DOM is better for cherry-picking pieces from different parts of it.

Like I said, probably pretty obvious to the rest of you, but it’s hitting me over the head pretty hard these days.

