[Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

Mon Apr 26 18:19:47 EDT 2004

Bjorn Stabell wrote:

> Formulator:
> * gets charset from manage_page_charset (same as ZMI), but can be overridden
> * stores field values as encoded text (not Unicode), but lets you specify
> which encoding to use
>   (confusingly calls this "unicode" mode)
> * messages are stored as UTF-8 (hardcoded)

While there is no question about the confusingness of the user interface 
of Formulator pertaining unicode, most of this is not correct (unless 
there are bugs I don't know about).

Formulator has two modes; unicode mode and 'classic mode'. In unicode 
mode, all strings are stored as Python unicode strings. In classic mode, 
all strings are stored in 'whatever encoding the user is using'. It's 
possible to convert from one mode to another, and for this switching 
behavior an encoding to use can be specified. In unicode mode, that 
encoding is ignored, however.

Classic mode basically exists so as not to break all Formulator forms 
already in existence. This complicated the design significantly, but I 
thought this was important.

Quite independently from this, fields can also be configured to 
*deliver* unicode upon validation/conversion. The character set is 
specified of the page that the form is in can be specified in the form 
settings.

> I suggest this way of dealing with Unicode right now in Zope 2:

General note: this way sounds good to me, but I know from hard 
experience how difficult it is to convert an existing application to 
fully unicode.

> (1) Let ZPublisher do the encoding/decoding of form input and HTML output:
> 
>   a. Always set a character encoding in a HTTP Content-type request

Silva does this (and Formulator too).

>   b. Always append :ustring/utext/ulines/utokens:ENCODING to field names of
> fields that support Unicode
>       (we may need some library code to make this easier)

Formulator won't be able to do 'b' very easily. It'll do its own 
converting to unicode though for fields that want this.

> (2) Store Unicode strings directly in the ZODB.  The ZODB is perfectly
> capable of storing strings in Python's internal Unicode format; no need to
> encode the text to UTF-8 or some other encoding.

Silva has been doing this fully since version 0.9.2, released in the 
summer of last year. Formulator took a while longer to catch up (before 
it would only interoperate if the form titles etc were only ascii), but 
is now a first class citizen in a Zope/unicode environment. Its XML 
serialization is UTF-8 in this mode.

> (3) Encode/decode yourself when reading from/ writing to other external data
> sources such as files and other databases.  Do it just before you write, or
> just after you read, so that as much code as possible can be
> encoding-agnostic.  Keep the encoding/decoding as close to the "source data"
> as possible.   The best way to do it is (in most cases) to specify the
> encoding on the IO stream, and let Python do the encoding/decoding for you
> transparently.  If possible, get the encoding from the external data source
> (e.g., the file) instead of relying on a magical global variable.  If you
> have to rely on a global variable, let it be manage_page_charset.
> 
> (4) [This is really just advice...] Resist patching your code to work with
> components that doesn't deal with Unicode.  Others are likely having the
> same problem, so to avoid ending up with lots of ugly patches (that are the
> source of mysterious Unicode problems), fix the problem at its source: the
> other component.  It's really not that difficult to fix (if we agree on how
> it should be fixed ;)

It's actually quite difficult to fix if you care about backwards 
compatibility. Fixing Formulator was quite complicated. You're 
definitely making this sound far easier than it is. It's a good thing to 
do, Silva has it, but the words 'not that difficult' don't fit in this 
debate.

> None of the above components handles Unicode in this way, but it seems to be
> how the Unicode support in Zope 2 was meant to be used. 

You're actually wrong about Formulator. :)

Regards,

Martijn