[Zope-dev] RE: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

Mon Apr 26 06:17:14 EDT 2004

> --On Montag, 26. April 2004 10:53 Uhr +0200 David Convent 
> <david.convent at naturalsciences.be> wrote:
> 
> > I always believed that unicode and utf-8 were same encoding, but 
> > reading you let me think i was wrong.
> > Can you tell me what the difference is between unicode and utf-8 ?

Andreas Jung wrote: 
> Unicode is common database for almost all characters. UTF-8 
> is an *encoding* that allows you to represent any element of 
> this character database as set for 1,2,3 or 4 bytes. There 
> are also other encoding e.g. like UTF16 that encode an 
> element in a different way....so we are talking about 
> completely different things.

Yes, the difference is that Python has a whole different understanding of
Unicode strings (type(u"")) than it has of text of some character encoding
(e.g., UTF-8, GB18030, ISO8859-1, ASCII, stored as type("")).  Python will
of course represent these unicode strings internally some way (maybe as a
16-bit integer?), but we don't need to know what that is like.  All we need
to know is that this is a string that can contain any character on the
planet, and that we can reasonably expect normal text operations to work on
it.

UTF-8 is, similar to ISO-8869-1 (latin1), just a character encoding.  It
(and UTF16, UCS2, UCS4) is only special in that it was issued by the Unicode
consortium and can encode any Unicode character.  Wherease ISO-8859-1 (for
example), being only 8 bits, can only encode characters used in Western
Europe.  GB18030, to take another extreme, is a 32-bit encoding endorsed by
the Chinese govnerment; being 32-bit, it can encode/represent a lot of
Unicode characters, even many non-Chinese ones; it is big enough to
potentially encode any Unicode character, if the Chinese government defined
how each Unicode code point was mapped into GB18030.  In this case, it would
be similar in function to UCS4 (I think it is).

Internally, we want to work with Unicode strings (where str[4] is the 4th
character) instead of UTF-8 encoded text strings (where str[4], being the
4th byte, has little semantic meaning).

Bye,
-- 
Bjorn