[Zope3-dev] Unicode handling in Zope3 Page Templates

Tue Aug 31 05:08:31 EDT 2004

Jim Fulton wrote:
> Sidnei da Silva wrote:
>> So, I've created a couple pieces of content, and everything worked
>> fine on the Zope 2 side. I can see the titles with the correct chars
>> both in the ZMI and in Plone.
>> However, if I try to view the same object in a Zope 3 ZPT, I get a
>> unicode error (UnicodeDecodeError I think).
>>
>> If I use sys.setdefaultencoding('utf-8') in sitecustomize.py,
>  
> I consider this evil.  I know Guido would be happy if this wasn't in
> the language.  This is just not an option.

I think Sidnei shouldn't even have brought this up; it should have 
absolutely nothing to do with the rest of the discussion and what Sidnei 
actually did in the end. Sidnei, please please get rid of this 
sys.setdefaultencoding() thing and forget it ever existed -- it's 
absolutely impossible to support code that even faintly looks like it 
may depend on it.

This code isn't depending on setdefaultencoding() though, as far as I 
can see.

>> I've made some changes to tales, tal and pagetemplate and added a test
>> to confirm that if only encoded strings are returned it doesn't
>> break. And if only unicode strings are returned it also doesn't break.
>>
>> However, this changes the current state, in which ZPT always returns
>> unicode to a state where if you get at least a method or attribute
>> returning unicode, you will get unicode output, whereas if you dont
>> have any unicode involved, you will get a string.
>>
>> Any chances this can be integrated into Zope 3,
> 
> Probably not. Certainly not in it's current form.

What is wrong with the current form, Jim? What it tries to do is make 
the page template engine 'unicode agnostic', instead of 'unicode only'. 
Right now, the page template engine in Zope 3 (unlike the one in Zope 
2), only works if you feed in unicode strings only. This is fine in Zope 
3, but if you want to use it outside of Zope 3, this may not be what you 
want.

So, Sidnei attempted to change it so that it'll work if you put in 
normal (encoded) strings only, or if you put in unicode strings only. 
Combinations will still fail miserably (this hasn't changed). The 
failure will even happen in the same place -- getvalue() in StringIO at 
the end. The only type of thing that can be combined safely with both is 
plain ascii strings; i.e. it relies on the unchanged default encoding of 
the system.

I can see something wrong with the following hack Sidnei employed in a 
few places, where he replaced

unicode(text)

with

isinstance(text, basestring) and text or unicode(text)

..

plain unicode(text), like str(text), doesn't typically work in unicode 
agnostic code.

Trying to reconstruct the logic in more readable form (which is 
difficult, indicating that this code shouldn't be employed :), Sidnei's 
code looks like this, I think..:

if isinstance(text, basestring):
     result = text
else:
     result = unicode(text)

this is the wrong thing if text is in fact not a basestring, but, say, a 
number, which I suspect is something that can happen, even though the 
thing is misleadingly called 'text' -- it's why the 'unicode()' is there 
in the first place. In this case, the string representation of the 
number will be in unicode, which will be wrong if you're running in 
pure-encoded mode. Sidnei, you need to include a unit test where the 
data that enters the page template is not a string; I think it will 
fail. Also include a few tests where the data is actually 0 while you're 
at it, if you are a fan of shortcuts. :)

Anyway, what would work better is the following:

if ininstance(text, basestring):
     result = text
else:
     result = str(text)

As long as str(text) == str(unicode(text)) is True (and doesn't fail 
with a unicode error), this will at least work correctly in both unicode 
mode (as it can deal with plain-ascii) as well as encoded mode (as it 
can deal with plain ascii).

For built-ins outside unicode strings, str(text) == str(unicode(text)) I 
think always applies. The problem remains with other objects which are 
not built-ins which may want to return unicode strings; i.e. custom 
objects which define __unicode__(). Perhaps i18n-ed strings? -- that's 
another good candidate for a test.

If we *do* need unicode(text) to work safely, we'll need to refactor the 
ZPT code so it can actually run in 'encoded mode' as well in 'unicode 
mode'. Then any cases where we see 'unicode(text)' (not many, mind), 
need to be replaced with something like:

if encoded_mode:
     result = str(text)
else:
     result = unicode(text)

 > The problem is that you can't really predict what the
 > encoding will be in Zope 2.  IMO, it is better not to guess.

That's not what Sidnei's code is trying to do. I suggested to him to try 
to make it unicode-agnostic. :)

 > If you did guess, you'd probably want to guess latin 1.

That would fail miserably in very common Zope 2 systems, like Silva or 
Plone. :)

 > I don't have any good ideas for a short-term hack.  Maybe someone
 > else does.

My best hack so far is what I proposed above. It's not that different 
from Sidnei's, though less buggy. :)

Regards,

Martijn