[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Tres Seaver tseaver at palladion.com
Mon Jan 15 14:55:11 EST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andreas Jung wrote:
> 
> --On 14. Januar 2007 18:14:45 +0000 Chris Withers <chris at simplistix.co.uk> 
> wrote:
> 
>> Dieter Maurer wrote:
>>> A halfway intelligent parser would accept Unicode when it gets it
>>> and concentrate on the remaining part of its task: either reporting
>>> structural events or building a parse tree.
>> The trivial fix I use in Twiddler is as follows:
>>
>> if isinstance(source,unicode):
>>    source = source.encode('utf-8')
>>
>> Of course, this assumes a heading of either <?xml version="1.0"
>> encoding="utf-8"?> or a missing encoding attribute, in which case the xml
>> spec states that the string must be utf-8 encoded.
> 
> The encoding of the XML preamble should not matter when parsing a XML
> document stored as unicode string.

That encoding is a *lie*, which is the real problem.  Parsers expect it
to be *correct*, and if missing, expect the text to be encoded as UTF-8,
per the spec (if the document comes from an HTTP request, then the
application may supply the encoding from the request headers).

Nothing in the XML specs allows or specifies and behavior for XML
documents serialized as unicode, becuase such serializations are
*programming language specific*.

> It is of importance as soon as you 
> convert the document back to a stream e.g. when we deliver the content
> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with 
> that by changing the encoding parameter of the preamble for XML documents 
> based on the desired output encoding. utf-8 is always a good choice however
> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
> publisher "avoids" this problem converting the unicode result using 
> errors='replace' (which is likely something we might discuss :-))

Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFq9wf+gerLs4ltQ4RAvBkAKCGZke7HHr7vWQKcwn5IHW93GHlFQCgyXMJ
a+vZYi2VRnZTt1XBt7O6U3Y=
=+i3B
-----END PGP SIGNATURE-----



More information about the Zope3-dev mailing list