[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Dieter Maurer dieter at handshake.de
Wed Jan 17 16:32:18 EST 2007


Martijn Faassen wrote at 2007-1-16 23:19 +0100:
>Dieter Maurer wrote:
>> Martijn Faassen wrote at 2007-1-15 15:44 +0100:
>>> ....
>>> I would say refusing to guess and bailing out with an error message is
>>> better in this case.
>> 
>> I disagree with you.
>> 
>>   Logically, parsing an encoded XML document consists of two
>>   passes: decode the encoded string into unicode and reconstruct
>>   the XML info elements from the serialization.
>> 
>>   Traditionally, these two passes are not performed one after
>>   the other but folded together in a single pass.
>>   
>>   But that tradition should not prevent to separate out the
>>   (Unicode) decoding phase. And after this phase is done,
>>   there is not ambiguity left with the "XML declaration".
>>   Its encoding attribute is simply irrelevant for the second phase
>>   (apart from generating the PI info element).
>
>That's nice as far as it goes. What if after the second phase you need 
>to parse the XML again?
>What do you do with your encoding header then? 

After the second phase, I now longer have an XML string but
instead either a sequence of events (SAX style) or a tree of
XML info elements (syntax tree style).

But, whatever I have, the second stage does not magically change
my unicode string. It could be parsed over and over again.

>If it's irrelevant, you better strip it out before you put it into the 
>parser.

I loose information then. The event stream or info element tree
lacks the XML declaration PI then, or at least its "encoding" attribute.

The parsing process is allowed to loose some information.
For example it can loose whitespace details or the order
of attributes. I don't know whether the loss or modification
of "PI"s is considered acceptable. In general, this would
definitely be wrong.

I have read some article in "comp.text.xml" that complained
about the loss of the encoding information -- at it may be a good hint
about the default encoding to be used on encoding/serialization.
This menas that some XML processing systems loose the information
and not everyone is happy.



-- 
Dieter


More information about the Zope3-dev mailing list