[Zope3-dev] i18n, unicode, and the underline

Mon, 14 Apr 2003 13:06:28 -0400

> From: Shane Hathaway <shane@zope.com>
> 
> >>That still leaves the literal string "hole", and IMHO it's a really big 
> >>hole.  Three methods have been suggested for patching this hole: 
> >>prefixing all literal strings with "u", calling the unicode() builtin in 
> >>code that concatenates strings, or using 'python -U'.  Since 'python -U' 
> >>doesn't quite work, we only have the first two options for now, and both 
> >>are a burden for the programmer.  One requires uglifying the source, and 
> >>the other requires deeper knowledge than we wanted to require of 
> >>programmers.  Ouch.

> Guido van Rossum wrote:
> > I've forgotten the context...  Why you would want string manipulation
> > functions to return Unicode even when the result can be expressed as
> > ASCII?

> From: Shane Hathaway <shane@zope.com>
> 
> Hmm, Python does try very hard to hide the difference between ASCII 
> strings and Unicode, so you have a good point.  What's missing is the 
> ability to clearly distinguish between ASCII strings and binary strings. 
>   When a function that expects only ASCII or Unicode gets a binary 
> string, it might blow up, but not every time, and the source of the 
> error is often hard to find.  This has caused pain for Zope 3 developers.

What were the sources of binary strings in the cases where it caused
pain?  Were they string literals or read from a file or socket?

> What if strings had a "binary" flag?  Any attempt to combine a
> binary string with a Unicode should fail, even if the binary string
> has all of the high bits unset.  Literal strings should be ASCII
> strings unless they have any characters with the high bit set or
> they have '\0' characters.
> 
> Errors in combining strings with Unicode would probably be caught
> earlier this way.  This would be a little different from the new
> byte array type, since binary strings would be immutable and share
> implementation with ASCII strings.

Alas, that's a language change, and an odd one at that.  I'm not sure
that it solves the problem right.  I first need to understand the
problem better though to think more about a solution.

So far the problem still sounds like "combining non-Unicode strings
and Unicode strings causes problems", and it's hard to propose a
single solution to that... :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)