[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Tres Seaver tseaver at palladion.com
Mon Jan 15 16:57:05 EST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martijn Faassen wrote:
> Tres Seaver wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Andreas Jung wrote:
>>> --On 14. Januar 2007 18:14:45 +0000 Chris Withers <chris at simplistix.co.uk> 
>>> wrote:
>>>
>>>> Dieter Maurer wrote:
>>>>> A halfway intelligent parser would accept Unicode when it gets it
>>>>> and concentrate on the remaining part of its task: either reporting
>>>>> structural events or building a parse tree.
>>>> The trivial fix I use in Twiddler is as follows:
>>>>
>>>> if isinstance(source,unicode):
>>>>    source = source.encode('utf-8')
>>>>
>>>> Of course, this assumes a heading of either <?xml version="1.0"
>>>> encoding="utf-8"?> or a missing encoding attribute, in which case the xml
>>>> spec states that the string must be utf-8 encoded.
>>> The encoding of the XML preamble should not matter when parsing a XML
>>> document stored as unicode string.
>> That encoding is a *lie*, which is the real problem.  Parsers expect it
>> to be *correct*, and if missing, expect the text to be encoded as UTF-8,
>> per the spec (if the document comes from an HTTP request, then the
>> application may supply the encoding from the request headers).
>>
>> Nothing in the XML specs allows or specifies and behavior for XML
>> documents serialized as unicode, becuase such serializations are
>> *programming language specific*.
> 
> While I agree that the encoding declaration is ambiguous at best and 
> should be rejected, you can find a bit in the spec which supports XML as 
> Python unicode strings. A Python unicode string can be seen as a string 
> with "external character encoding information": it's the native encoding 
> of Python. Therefore we can make sense of it in an XML parser. For my 
> previous analysis of the spec see here:
> 
> http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html
> 
> What however is bad and evil is to just ignore conflicting encoding 
> declarations in an XML document itself. I'd choose either one of:
> 
> * bail with a clear error when unicode is supplied at all
> 
> * bail with a clear error when unicode is supplied with any explicit 
> encoding declaration in the XML.
> 
>>> It is of importance as soon as you 
>>> convert the document back to a stream e.g. when we deliver the content
>>> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with 
>>> that by changing the encoding parameter of the preamble for XML documents 
>>> based on the desired output encoding. utf-8 is always a good choice however
>>> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
>>> publisher "avoids" this problem converting the unicode result using 
>>> errors='replace' (which is likely something we might discuss :-))
>> Unicode XML is not only problematic for streaming. For instance, you
>> *can't* pass a Unicode string to the libxml2 *at all* , unless you want
>> a core dump.  The API requires that you pass it strings encoded as UTF8.
> 
> You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
> string type as far as I am aware.

It *requires* UTF-8-encoded strings.  See http://xmlsoft.org/xml.html

  12. So what is this funky "xmlChar" used all the time?

      It is a null terminated sequence of utf-8 characters. And only
      utf-8! You need to convert strings encoded in different ways to
      utf-8 before passing them to the API. This can be accomplished
      with the iconv library for instance.

Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode:
it isn't as though it can be easily processed in that form without
re-encoding (even if lxml is the one doing the re-encoding).  It isn't
"discourse", in the Zope3 sense of "text intended for human
consumption", and the tools people use with it are all going to expect
some kind of validly-encoded string.


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFq/ix+gerLs4ltQ4RAmkTAJ9ifMH37TNyfZXo+v5zvXCsrFXIXQCfZFow
GBTndXG+0Gw9OnAZeNCxADs=
=Yr7F
-----END PGP SIGNATURE-----



More information about the Zope3-dev mailing list