[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Martijn Faassen faassen at startifact.com
Mon Jan 15 16:08:58 EST 2007


Tres Seaver wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Andreas Jung wrote:
>> --On 14. Januar 2007 18:14:45 +0000 Chris Withers <chris at simplistix.co.uk> 
>> wrote:
>>
>>> Dieter Maurer wrote:
>>>> A halfway intelligent parser would accept Unicode when it gets it
>>>> and concentrate on the remaining part of its task: either reporting
>>>> structural events or building a parse tree.
>>> The trivial fix I use in Twiddler is as follows:
>>>
>>> if isinstance(source,unicode):
>>>    source = source.encode('utf-8')
>>>
>>> Of course, this assumes a heading of either <?xml version="1.0"
>>> encoding="utf-8"?> or a missing encoding attribute, in which case the xml
>>> spec states that the string must be utf-8 encoded.
>> The encoding of the XML preamble should not matter when parsing a XML
>> document stored as unicode string.
> 
> That encoding is a *lie*, which is the real problem.  Parsers expect it
> to be *correct*, and if missing, expect the text to be encoded as UTF-8,
> per the spec (if the document comes from an HTTP request, then the
> application may supply the encoding from the request headers).
> 
> Nothing in the XML specs allows or specifies and behavior for XML
> documents serialized as unicode, becuase such serializations are
> *programming language specific*.

While I agree that the encoding declaration is ambiguous at best and 
should be rejected, you can find a bit in the spec which supports XML as 
Python unicode strings. A Python unicode string can be seen as a string 
with "external character encoding information": it's the native encoding 
of Python. Therefore we can make sense of it in an XML parser. For my 
previous analysis of the spec see here:

http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html

What however is bad and evil is to just ignore conflicting encoding 
declarations in an XML document itself. I'd choose either one of:

* bail with a clear error when unicode is supplied at all

* bail with a clear error when unicode is supplied with any explicit 
encoding declaration in the XML.

>> It is of importance as soon as you 
>> convert the document back to a stream e.g. when we deliver the content
>> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with 
>> that by changing the encoding parameter of the preamble for XML documents 
>> based on the desired output encoding. utf-8 is always a good choice however
>> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
>> publisher "avoids" this problem converting the unicode result using 
>> errors='replace' (which is likely something we might discuss :-))
> 
> Unicode XML is not only problematic for streaming. For instance, you
> *can't* pass a Unicode string to the libxml2 *at all* , unless you want
> a core dump.  The API requires that you pass it strings encoded as UTF8.

You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
string type as far as I am aware.

Regards,

Martijn



More information about the Zope3-dev mailing list