[Zope3-dev] i18n, unicode, and the underline

Tue, 15 Apr 2003 12:41:54 +0200

sathya wrote:
> Just want to throw out this question:
> unicode strings may use 16 or 32 bit UTF encoding
> regular strings use 7 bit ascii
> what is this 8 bit ascii that is being talked about ?

There is no 8 bit ascii. Perhaps someone made a typo.

Python strings can be either 8-bit encoded strings or unicode strings (which 
uses some internal encoding we don't care about).

Python 8-bit encoded strings can have a large number of encodings that
Python understands. Common cases:

  * ascii
  * utf-8
  * latin-1

And a whole lot of encodings for various other character sets.

ascii fits in 7 bits; the other 128 characters in 8 bits are not defined.
Unicode has the nice property that is cooperates fine with ascii, as the
first 128 characters are exactly as defined by ascii. Python exploits this
by assuming all 8 bit strings are encoded in ascii by default, so if you
combine an ascii string with a unicode string in some way, you'll get
unicode output without error:

u"foo" + "bar" -> u"foobar" 

You can also use the unicode and .encode methods:

text = unicode(asciitext)

# this will work only if text contains characters that can be encoded as
# ascii; any other character included will give you a unicode error unless
# you take special measures to fudge around that.
asciitext = text.encode()

You can use explicit encoding and decoding too:

text = unicode(asciitext, 'ascii')

asciitext = text.encode('ascii')

utf-8 is an encoding of unicode in 8 bits (so that it fits in Python 8
bits strings). This encoding maps anything in unicode to 8 bits perfectly,
so this is the suggested encoding for Zope input and output if the external
system (for instance a web browser) can deal with it. It also has the property
that 7-bit ascii is a subset of it, so you can read it pretty well if it
contains english, even in an application that does not understand utf-8.

To go to and from Unicode seamlessly:

text = unicode(utf8text, 'utf-8')

# this is guaranteed never to blow up on you with a unicode error,
# as utf-8 can encode any unicode character.
utf8text = text.encode('utf-8')

latin-1 is an encoding that uses the full 8 bits as well. Many systems
deal with this by default (for instance web browsers). It contains ascii
in the first 7 bits, and then is extended with all kinds of accented
characters in use in various european languages like German and French.
The higher 128 characters do *not* map into unicode without explicit
translation. So, if you have latin-1 text (or something in any other
encoding), you *have* to decode it explicitly before you can use it
with unicode:

text = unicode(latin1text, 'latin-1')

# and back again. This can blow up if text contains characters that cannot
# be represented by latin-1
latin1text = text.encode('latin-1')

A common mistake is to take something like latin-1 and consider it is
ascii, and then mixing it with unicode. This will cause code to blow up
as soon as your latin-1 string contains a character that is not also
an ascii character.

To make an application internationalized, use unicode throughout. This is the
way of the future. Of course, you'll be dealing with outside systems
that do use encodings (the network, relational databases, the filesystem,
any python modules that aren't unicode aware). Whenever you do that, you
need to make sure that at this boundary you:

  * know the encoding the input is in (or make a solid assumption about it
    if you can't be absolutely sure, like for http requests), and use
    unicode(inputtext, encoding) to encode it as unicode.

  * make sure that any unicode you want to output is first encoded in the
    output encoding the external application expects, with 
    outputtext.encode(encoding).

If the external application can deal with utf-8 text, use that, as that
way you'll never have .encode() blowing up at you with a unicode error.

The Zope 3 framework will try to make all this thinking mostly unnecessary,
as its own input/output subsystems will do the right thing for you
automatically, so you can quit worrying and just use unicode strings
throughout.

Regards,

Martijn