[Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

David Convent david.convent at naturalsciences.be
Mon Apr 26 04:53:17 EDT 2004


Hi Bjorn,

I always believed that unicode and utf-8 were same encoding, but reading 
you let me think i was wrong.
Can you tell me what the difference is between unicode and utf-8 ?

Bjorn Stabell wrote:

>While we're all waiting for Zope 3 and Plone 3, I'd like to know what the
>"standard practice" way of using Unicode with Zope 2.  In particular, we'd
>like to store all text as Unicode in the ZODB, and have Zope do the
>encoding/decoding as automatically and transparently as possible.
>
>We've been using Zope 2's ZPublisher to do this encoding/decoding for over 2
>years, and it's working fine.  We just have to ensure that we set the
>appropriate encoding in a HTTP Content-type header, and that we add
>:utext/ustring:ENCODING to HTML form field names.  Regardless of what you
>may have heard, THIS WORKS FINE!  We also store Unicode, not UTF-8 (or other
>encodings), strings in the ZODB.
>
>The problem we're running into are with other components, basically making
>our Unicode-with-Zope experience, shall we say, less than ecstatic (To put
>it this way, I seem to lose hair much faster when dealing with Unicode
>problems :)   I'm wondering why components/products aren't all relying on
>the ZPublisher for Unicode encoding/decoding?  Is there another standard
>way?
>
>Here is a summary of what we've found:
>
>ZMI
>* gets charset from manage_page_charset encoding
>* relies on ZPublisher for encoding (but doesn't do decoding, see below)
>* in PropertyManager you can add ustrings, but since it doesn't add
>:ENCODING to the field names, you get a Unicode error when trying to save
>since it tries to decode the text assuming ASCII (big problem)
>* DTML Methods/Documents: doesn't support Unicode (annoying)
>* can't use Unicode id's (not a big problem)
>
>Archetypes:
>* gets charset from portal_url.getCharset() or
>portal_properties.site_properties.default_charset
>* doesn't rely on ZPublisher, does its own encoding/decoding
>* returns encoded strings, not Unicode strings, to Zope apps, leading to
>problems such as:
>  - SearcableText() encodes, and as such can't be used with Unicode-aware
>ZCatalogs
>  - transform() encodes
>    (and because of that SearchableText() sometimes decodes/encodes 2 times
>instead of 0 times)
>  - get()ing field values will encode them, so if you want Unicode, you have
>to decode yourself
>    (adding both unnecessary overhead for data access, and unnecessary
>dependency on the global variable for the charset)
>
>Plone:
>* no special Unicode support for HTML forms; relies on Archetypes
>
>Formulator:
>* gets charset from manage_page_charset (same as ZMI), but can be overridden
>* stores field values as encoded text (not Unicode), but lets you specify
>which encoding to use
>  (confusingly calls this "unicode" mode)
>* messages are stored as UTF-8 (hardcoded)
>
>
>I suggest this way of dealing with Unicode right now in Zope 2:
>
>(1) Let ZPublisher do the encoding/decoding of form input and HTML output:
>
>  a. Always set a character encoding in a HTTP Content-type request
>
>  b. Always append :ustring/utext/ulines/utokens:ENCODING to field names of
>fields that support Unicode
>      (we may need some library code to make this easier)
>
>(2) Store Unicode strings directly in the ZODB.  The ZODB is perfectly
>capable of storing strings in Python's internal Unicode format; no need to
>encode the text to UTF-8 or some other encoding.
>
>(3) Encode/decode yourself when reading from/ writing to other external data
>sources such as files and other databases.  Do it just before you write, or
>just after you read, so that as much code as possible can be
>encoding-agnostic.  Keep the encoding/decoding as close to the "source data"
>as possible.   The best way to do it is (in most cases) to specify the
>encoding on the IO stream, and let Python do the encoding/decoding for you
>transparently.  If possible, get the encoding from the external data source
>(e.g., the file) instead of relying on a magical global variable.  If you
>have to rely on a global variable, let it be manage_page_charset.
>
>(4) [This is really just advice...] Resist patching your code to work with
>components that doesn't deal with Unicode.  Others are likely having the
>same problem, so to avoid ending up with lots of ugly patches (that are the
>source of mysterious Unicode problems), fix the problem at its source: the
>other component.  It's really not that difficult to fix (if we agree on how
>it should be fixed ;)
>
>
>None of the above components handles Unicode in this way, but it seems to be
>how the Unicode support in Zope 2 was meant to be used.  Let me know if
>there is another better way, but please do let me know...  I think we need
>to resolve this once and for all or I know some people that'll just go mad
>(or bald, or both) :)
>
>I'll be willing to contribute patches, but since this applies to so many
>products, it would be good to get some consensus first.  At the very least,
>can we create a "Standard Unicode Practices" page?
>
>
>Bye,
>  
>


-- 
David Convent




More information about the Zope-Dev mailing list