[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Tue Jan 16 06:27:25 EST 2007

Tres Seaver wrote:
[snip]
>>> Unicode XML is not only problematic for streaming. For instance, you
>>> *can't* pass a Unicode string to the libxml2 *at all* , unless you want
>>> a core dump.  The API requires that you pass it strings encoded as UTF8.
>> You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
>> string type as far as I am aware.
> 
> It *requires* UTF-8-encoded strings.  See http://xmlsoft.org/xml.html

>   12. So what is this funky "xmlChar" used all the time?
> 
>       It is a null terminated sequence of utf-8 characters. And only
>       utf-8! You need to convert strings encoded in different ways to
>       utf-8 before passing them to the API. This can be accomplished
>       with the iconv library for instance.

Um, Tres, no need to tell me about the libxml2 API..

There is also the libxml2 *python* API, which I believe has a knob to 
turn on the ability to pass in unicode strings, though I haven't tried 
that myself. Then there's of course lxml, which is a Python-layer which 
requires unicode or plain-ascii strings in its DOM-ish (elementtree 
API), and encoded data for the parser.

We should distinguish the behavior of libxml2 as a tree API (utf-8 all 
the way) and as a parser/serializer (all sorts of encodings). Generally 
XML libraries make a distinction between the two.

> Frankly, I don't get the desire to *store* a complete XML document (as
> opposed to the extracted contents of attributes or nodes) as unicode:
> it isn't as though it can be easily processed in that form without
> re-encoding (even if lxml is the one doing the re-encoding).  It isn't
> "discourse", in the Zope3 sense of "text intended for human
> consumption", and the tools people use with it are all going to expect
> some kind of validly-encoded string.

There are objects that allow you to edit XML; the ZPT page is an 
example. I do not know whether it stores as unicode right now, but you 
can argue it's text intended for human consumption, as humans are 
supposed to be editing it. :)

It may indeed make more sense to store this information as UTF-8 however 
from an efficiency point of view. This would probably still require 
recoding the data into unicode for the purposes of inspecting it and 
editing it.

Regards,

Martijn