[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Mon Jan 15 16:15:46 EST 2007

Andreas Jung wrote:

> --On 15. Januar 2007 15:44:01 +0100 Martijn Faassen 
> <faassen at startifact.com> wrote:
>> On 1/15/07, Andreas Jung <lists at zopyx.com> wrote:
>> [snip]
>>> ok, got it. But this problem can be solved easily by changing the
>>> encoding within the preamble.
>>
>> I would say refusing to guess and bailing out with an error message is
>> better in this case. The Zen of Python:
>>
>> In the face of ambiguity, refuse the temptation to guess.
>>
> 
> Sorry but I don't get your point. What's happening with a XML inside a ZPT?

My point is that:

u"<?xml version="1.0" encoding="ISO-8859-1"?><foo>Some non-ascii text</foo>"

is confusing at best. One part of this says it's a unicode string, the 
other part says it's in encoding latin-1. What is it? What happens to 
this if you recode this to, say, UTF-8? What happens to this if you 
parse and *then* serialize it? What does the developer expect will 
happen? What do users expect when they enter XML in a form and include 
an encoding declaration?

I proposed we make nobody worry about this by simply not accepting this.

> - XML data encoded as XXX comes in (either by editing the XML file through
>   the ZMI or FTP/WebDAV upload)
> 
> - ZPT converts the encoded string to unicode based on the encoding in 
> the preamble
> 
> - for parsing it is up to the application to decide what to do with the 
> data. It is not up to the editor to decide how the ZPT engine should 
> deal with XML internally. The ZPT engine decides to serializes the 
> unicode string as utf-8 and to fix the XML preamble (which will result 
> in a valid XML file
> which should identical with the original file - except the encoding 
> might be different).

> I still don't see what should ambiguous with this approach.

Ambiguous in that the string seems to say it's in two encodings at once. 
You're then "guessing": you're letting the Python string type trump the 
declaration. Then, since we've shown that leads to bugs, you propose 
actually change the encoding declaration of the XML document. I wonder 
what people then expect to happen upon serialization. In effect, your 
proposal would, I think, serialize to UTF-8 only, right? (in which case 
the encoding declaration can be dropped as it's the default)

Regards,

Martijn