[Zope3-dev] i18n, unicode, and the underline

Guido van Rossum guido@python.org
Mon, 14 Apr 2003 08:09:38 -0400


> Guido van Rossum wrote:
> > So I don't see how an application could get away with using "..." 
> > containing non-ASCII text at all -- it would fail as soon as the
> > literal was being output.

[Martijn]
> Note that in Zope 2 (at least until 2.6 which is a bit better) this
> can lead to obscure unicode errors occuring in unexpected places,
> dependent on user input and so on.

I've heard this before, and it seems to be the main reason why Zope 3
has chosen an (IMO unnecessary) anal attitude towards Unicode.  But
nobody has ever been able to show me what the problems were.  (I'm not
doubting there were problems; I just need more details to understand
what was the matter.)

I'd like to see at least one report of an actual case, because there
are a few different ways that Unicode can cause errors.

One way to get Unicode problems is non-ASCII in 8-bit text that should
have been converted to Unicode (which means that the encoding must be
known).

Another way to get Unicode problems is that Unicode somehow got mixed
in with ASCII and the ASCII got promoted to Unicode, after which point
the I/O library can't deal with it (e.g. you can't write a Unicode
string to a file or socket).

Which was it?

> I now know how python's unicode system works, but most Python
> developers really don't, in my experience.

That (i.e. understanding) seems to be the core problem.  Maybe it
helps to explain that instead of thinking about 8-bit strings
vs. Unicode strings, it helps to think of THREE categories.  These
are:

1) ASCII.  This is always safe, and there's no gain or loss in using
   Unicode literals.  E.g. sys.stdout.write(u"abc") works just fine,
   and "abc"==u"abc", and you can even mix Unicode and 8-bit as dict
   keys: d={"8":8, u"u":16}; print d[u"8"], d["u"].

2) 8-bit strings containing some *encoded* form of not-just-ASCII text
   (even Latin-1 is an encoding).  These are nasty.  They can be moved
   around as 8-bit strings, but since Python doesn't know which
   encoding is used, they can't be mixed with Unicode in any way.  You
   may even get a problem when a dict contains a key of this kind when
   doing an unrelated Unicode key lookup!

3) True Unicode strings (containing not-just-ASCII text).  Ignoring 16
   vs. 32 bits issues, these are not encoded.  They have the reverse
   problem as 8-bit strings: the I/O library doesn't know how to deal
   with these!

> The bugs are nasty to track down and don't encourage developer
> enlightenment either - the response is "oh, one of these unicode
> weird errors again", without understanding what the right approach
> should be.

That's why I'd like to see some samples of these, rather than blanket
reports.

> Now in Zope 3 this unicode confusion may not take place as Zope 3 is
> unicode in the core, and input and output are frequently
> automatically translated. If confusion is indeed less, using ascii
> literals may be easy to explain. But if it would lead to anything
> like the Zope 2.x situation, where non-ascii user input can cause
> the system to complain at unexpected code paths, then let's please
> remain hard-line (even to the point of silliness) about unicode.
> 
> It's far easier to explain "use unicode everywhere" than to explain
> the other alternatives.

So I've noticed, and that leads to the (in my eyes) abomination of
u"..."  literals containing ASCII strings.  We need to stamp out this
misunderstanding!!!

--Guido van Rossum (home page: http://www.python.org/~guido/)