[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Martijn Faassen faassen at startifact.com
Tue Jan 16 06:27:25 EST 2007


Tres Seaver wrote:
[snip]
>>> Unicode XML is not only problematic for streaming. For instance, you
>>> *can't* pass a Unicode string to the libxml2 *at all* , unless you want
>>> a core dump.  The API requires that you pass it strings encoded as UTF8.
>> You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
>> string type as far as I am aware.
> 
> It *requires* UTF-8-encoded strings.  See http://xmlsoft.org/xml.html

>   12. So what is this funky "xmlChar" used all the time?
> 
>       It is a null terminated sequence of utf-8 characters. And only
>       utf-8! You need to convert strings encoded in different ways to
>       utf-8 before passing them to the API. This can be accomplished
>       with the iconv library for instance.

Um, Tres, no need to tell me about the libxml2 API..

There is also the libxml2 *python* API, which I believe has a knob to 
turn on the ability to pass in unicode strings, though I haven't tried 
that myself. Then there's of course lxml, which is a Python-layer which 
requires unicode or plain-ascii strings in its DOM-ish (elementtree 
API), and encoded data for the parser.

We should distinguish the behavior of libxml2 as a tree API (utf-8 all 
the way) and as a parser/serializer (all sorts of encodings). Generally 
XML libraries make a distinction between the two.

> Frankly, I don't get the desire to *store* a complete XML document (as
> opposed to the extracted contents of attributes or nodes) as unicode:
> it isn't as though it can be easily processed in that form without
> re-encoding (even if lxml is the one doing the re-encoding).  It isn't
> "discourse", in the Zope3 sense of "text intended for human
> consumption", and the tools people use with it are all going to expect
> some kind of validly-encoded string.

There are objects that allow you to edit XML; the ZPT page is an 
example. I do not know whether it stores as unicode right now, but you 
can argue it's text intended for human consumption, as humans are 
supposed to be editing it. :)

It may indeed make more sense to store this information as UTF-8 however 
from an efficiency point of view. This would probably still require 
recoding the data into unicode for the purposes of inspecting it and 
editing it.

Regards,

Martijn



More information about the Zope3-dev mailing list