[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Martijn Faassen faassen at startifact.com
Mon Jan 15 07:26:16 EST 2007


Andreas Jung wrote:
[snip]

[Bernd Dorn]
>> IMHO it should only accept strings, because in the value should be a xml
>> string and therefore always has to be encoded in 'utf-8' or in the
>> encoding specified in the processing instruction.
>>
> 
> I disagree with that. Since Zope 3 is supposed to use unicode internally
> (at least that's the legend) it should support unicode also at the 
> parser level. Other languages like Java store XML also as unicode 
> strings and support parsing it.

Bernd Dorn raises a good point though, and it's one you need to think 
about carefully. To say "languages like Java store XML also as unicode" 
is rather ambiguous. While I'm not aware of the details of Java, 
serialized XML is typically stored in some encoded form, most commonly 
UTF-8 (the default 8 bit encoding), but latin 1 is also supported, and 
there are also multi-byte encodings. *Parsed* XML exposed through a DOM 
is exposed as unicode strings. I'm sure Java supports this usage 
patterns, as naturally files on disk need to be parsable.

Here you are talking about parsing XML, so maintaining the position that 
this should be encoded is a reasonable one. This is how for instance the 
Python ElementTree operates (parse encoded, expose API as unicode (or 
pure ascii)), and this has been designed by Fredrik Lundh, who, as you 
may know, was instrumental in developing Python's unicode support.

How would you propose to parse the following unicode string?

u"<?xml version="1.0" encoding="ISO-8859-1"?><foo />"

If you are going to allow the parsing of unicode strings, I would 
strongly recommend *rejecting* any unicode string that itself declares 
an encoding as ambiguous: refuse to guess.

With lxml (which is an extension of the ElementTree API) we've taken the 
latter option: it's possible to pass a unicode string into the parser, 
but if that contains an encoding declaration, there will be an error. 
Underneath we actually re-encode this string back to UTF-8, as that's 
what the libxml2 parser expects. We made this change with the objections 
of Fredrik Lundh by the way - we felt user errors would be mostly 
prevented because it refuses to guess.

Regards,

Martijn



More information about the Zope3-dev mailing list