[Zope-dev] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

Mon Apr 26 04:27:43 EDT 2004

While we're all waiting for Zope 3 and Plone 3, I'd like to know what the
"standard practice" way of using Unicode with Zope 2.  In particular, we'd
like to store all text as Unicode in the ZODB, and have Zope do the
encoding/decoding as automatically and transparently as possible.

We've been using Zope 2's ZPublisher to do this encoding/decoding for over 2
years, and it's working fine.  We just have to ensure that we set the
appropriate encoding in a HTTP Content-type header, and that we add
:utext/ustring:ENCODING to HTML form field names.  Regardless of what you
may have heard, THIS WORKS FINE!  We also store Unicode, not UTF-8 (or other
encodings), strings in the ZODB.

The problem we're running into are with other components, basically making
our Unicode-with-Zope experience, shall we say, less than ecstatic (To put
it this way, I seem to lose hair much faster when dealing with Unicode
problems :)   I'm wondering why components/products aren't all relying on
the ZPublisher for Unicode encoding/decoding?  Is there another standard
way?

Here is a summary of what we've found:

ZMI
* gets charset from manage_page_charset encoding
* relies on ZPublisher for encoding (but doesn't do decoding, see below)
* in PropertyManager you can add ustrings, but since it doesn't add
:ENCODING to the field names, you get a Unicode error when trying to save
since it tries to decode the text assuming ASCII (big problem)
* DTML Methods/Documents: doesn't support Unicode (annoying)
* can't use Unicode id's (not a big problem)

Archetypes:
* gets charset from portal_url.getCharset() or
portal_properties.site_properties.default_charset
* doesn't rely on ZPublisher, does its own encoding/decoding
* returns encoded strings, not Unicode strings, to Zope apps, leading to
problems such as:
  - SearcableText() encodes, and as such can't be used with Unicode-aware
ZCatalogs
  - transform() encodes
    (and because of that SearchableText() sometimes decodes/encodes 2 times
instead of 0 times)
  - get()ing field values will encode them, so if you want Unicode, you have
to decode yourself
    (adding both unnecessary overhead for data access, and unnecessary
dependency on the global variable for the charset)

Plone:
* no special Unicode support for HTML forms; relies on Archetypes

Formulator:
* gets charset from manage_page_charset (same as ZMI), but can be overridden
* stores field values as encoded text (not Unicode), but lets you specify
which encoding to use
  (confusingly calls this "unicode" mode)
* messages are stored as UTF-8 (hardcoded)

I suggest this way of dealing with Unicode right now in Zope 2:

(1) Let ZPublisher do the encoding/decoding of form input and HTML output:

  a. Always set a character encoding in a HTTP Content-type request

  b. Always append :ustring/utext/ulines/utokens:ENCODING to field names of
fields that support Unicode
      (we may need some library code to make this easier)

(2) Store Unicode strings directly in the ZODB.  The ZODB is perfectly
capable of storing strings in Python's internal Unicode format; no need to
encode the text to UTF-8 or some other encoding.

(3) Encode/decode yourself when reading from/ writing to other external data
sources such as files and other databases.  Do it just before you write, or
just after you read, so that as much code as possible can be
encoding-agnostic.  Keep the encoding/decoding as close to the "source data"
as possible.   The best way to do it is (in most cases) to specify the
encoding on the IO stream, and let Python do the encoding/decoding for you
transparently.  If possible, get the encoding from the external data source
(e.g., the file) instead of relying on a magical global variable.  If you
have to rely on a global variable, let it be manage_page_charset.

(4) [This is really just advice...] Resist patching your code to work with
components that doesn't deal with Unicode.  Others are likely having the
same problem, so to avoid ending up with lots of ugly patches (that are the
source of mysterious Unicode problems), fix the problem at its source: the
other component.  It's really not that difficult to fix (if we agree on how
it should be fixed ;)

None of the above components handles Unicode in this way, but it seems to be
how the Unicode support in Zope 2 was meant to be used.  Let me know if
there is another better way, but please do let me know...  I think we need
to resolve this once and for all or I know some people that'll just go mad
(or bald, or both) :)

I'll be willing to contribute patches, but since this applies to so many
products, it would be good to get some consensus first.  At the very least,
can we create a "Standard Unicode Practices" page?

Bye,
-- 
Bjorn Stabell <mailto:bjorn at exoweb.net>