I was messing around with the Apache Xerces based XML DOMParser class (from the com.sun.org.apache.xerces.internal.impl.xs.dom package)for the JTwitt project and I noticed some quirky behavior. I used the following snippet of code:
DomParser parser = new DOMParser(); parser.parse(new InputSource(xmlStream)); Document d = parser.getDocument();
Pretty straightforward stuff – in fact, you probably find the same few lines in just about every single DOMParser tutorial out there. The xmlStream is an InputStream instance object with the XML data. Where do I get it from? I pull it off the Twitter as I described here. I tested this code before, and got the XML to print out in the console so my InputStream is not the issue here. Every time I called the parse method I got few dozen errors like this:
s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than ‘xs:appinfo’ and ‘xs:documentation’
It was basically one error per each node which had text data as a child. I was googling this message for hours and it seems that no one has a clue what causes it. I’m definitely not the first person who got it, but I have yet to see a working solution.
In the end I decided to abandon DomParser. There is about a bazillion different ways to parse XML files in Java so I simply switched to the JAXP parser (javax.xml.parsers). Now my code looks like this:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); Document d = builder.parse(xmlStream);
Both snippets are essentially equivalent and achieve the same thing. So as far as I’m concerned DocumentBuilder > DOMParser. Still, if anyone has a clue what is that s4s-elt-character error all about, please leave a note in the comments so that future generations do not have to suffer because of it.
[tags]java, java xml parsing, DOMParser, DocumentBuilder, JAXP, Xerces, Apache, programming, XML[/tags]