[Zope3-dev] i18n, unicode, and the underline

Guido van Rossum guido@python.org
Fri, 11 Apr 2003 11:14:41 -0400


I hate to add even more confusion, but as a matter of principle, I
cringe each time I see u"..." used for a string literal that contains
only ASCII characters.  It is the same kind of cringe I see when
someone writes ``return(foo)'' instead of ``return foo''.

The ultimate goal of adding Unicode strings to Python is to make *all*
strings be Unicode.  Jython already does this -- it ignores the 'u'
prefix to string literals, as all its strings are Java strings, which
are Unicode.  The distinction between 8-bit strings and Unicode in
CPython is a transitional measure, necessary because of the huge
amount of code that depends on this (especially C extensions).

When all strings are Unicode, of course we'll need a new data type to
represent "byte array", and I/O operations on binary files will use
these.  Conversion between byte arrays and strings will always be
explicit, and there probably won't be a literal type for byte arrays.

There's more to this, and I understand that since Zope 3 must work
with the status quo, it needs to take a stance about Unicode, but
nevertheless I want to fight the temptation to write ASCII strings as
u"..." literals.  The _() function would barf if its argument was an
8-bit string containing non-ASCII characters, because it would decode
its string argument using the ASCII encoding.  When 8-bit "..." string
literals containing only ASCII are used anywhere else, the same should
happen to them when they are passed to some code that expects Unicode.
So I don't see how an application could get away with using "..." 
containing non-ASCII text at all -- it would fail as soon as the
literal was being output.  OTOH I don't see any advantage to using
u"..." for literals containing only ASCII.

--Guido van Rossum (home page: http://www.python.org/~guido/)