[Zope3-dev] string agnostic page templates, again

Tue Sep 7 13:23:10 EDT 2004

Fred Drake wrote:
> On Mon, 06 Sep 2004 14:46:08 +0200, Martijn Faassen <faassen at infrae.com> wrote:
> 
>>It seems the previous discussion about changing the ZPT engine so it
>>also works with non-unicode text died down. This is still important for
>>Five scenarios where the page templates need to interact with legacy
>>Zope 2 content. Most Zope 2 content is not in unicode (though there are
>>some exceptions such as Silva). In this case it is important that the
>>page template engine can accept non-unicode input (*all* input being
>>classic strings) without breaking.
> 
> Sorry for not getting back to this earlier; Jim and I talked about
> this some last week, and left it on my plate to respond.  Well, here
> goes...
> 
> One of my goals here is to not have two implementations of TAL in the
> end.  I think we can accomodate this without a lot of pain.

Right; Five already exposes Zope 3's version to Zope 2, hopefully a step 
in the right direction. :)

> The most important thing to note is that we really only care whether
> something has been converted to a str or unicode value when writing it
> to the output stream.  Since the TAL interpreter already uses a
> subclass of StringIO, this can be handled by changing the write()
> mehod of that object and removing the calls to str() / unicode() that
> are the current cause for concern.  The modified write() method should
> keep track of whether a unicode value is ever seen, and whether a
> non-(str or unicode) value is ever seen.  If everything is a str or
> unicode, it doesn't need to do anything special.  Otherwise, it needs
> to apply str() or unicode() to anything that isn't one or the other to
> perform the conversion (which will be determined by whether unicode
> was ever seen), and then continue normally.

What if the first thing on the stream is latin-1 and then unicode gets 
added?

What if the first thing on the stream is a non-string object?

> I think that keeps the whole thing properly unicode-agnostic, and
> avoids having to deal with this issue all over the code; it becomes
> isolated, and can be handled by overriding the StringIO factory on the
> TALInterpreter if necessary.
> 
> Do you agree that this addresses your needs, Martijn?

It sounds like a good approach, just slightly worried about edge cases 
like the ones above.

It should break horribly as soon as possible (presumably earlier than in 
.getvalue() as is happening now) as soon as encoded (non-ascii) text is 
mixed with unicode.

Going through some sequences, in semi-regex patterns:

ascii*

unicode*

ascii+
(non-ascii|ascii)*

ascii+
unicode+
(ascii|unicode)*

unicode+
ascii*

non-ascii+
unicode
<immediate exception>

unicode+
non-ascii
<immediate exception>

Regards,

Martijn