[Zope3-dev] i18n, unicode, and the underline
Guido van Rossum
guido@python.org
Mon, 14 Apr 2003 13:06:28 -0400
> From: Shane Hathaway <shane@zope.com>
>
> >>That still leaves the literal string "hole", and IMHO it's a really big
> >>hole. Three methods have been suggested for patching this hole:
> >>prefixing all literal strings with "u", calling the unicode() builtin in
> >>code that concatenates strings, or using 'python -U'. Since 'python -U'
> >>doesn't quite work, we only have the first two options for now, and both
> >>are a burden for the programmer. One requires uglifying the source, and
> >>the other requires deeper knowledge than we wanted to require of
> >>programmers. Ouch.
> Guido van Rossum wrote:
> > I've forgotten the context... Why you would want string manipulation
> > functions to return Unicode even when the result can be expressed as
> > ASCII?
> From: Shane Hathaway <shane@zope.com>
>
> Hmm, Python does try very hard to hide the difference between ASCII
> strings and Unicode, so you have a good point. What's missing is the
> ability to clearly distinguish between ASCII strings and binary strings.
> When a function that expects only ASCII or Unicode gets a binary
> string, it might blow up, but not every time, and the source of the
> error is often hard to find. This has caused pain for Zope 3 developers.
What were the sources of binary strings in the cases where it caused
pain? Were they string literals or read from a file or socket?
> What if strings had a "binary" flag? Any attempt to combine a
> binary string with a Unicode should fail, even if the binary string
> has all of the high bits unset. Literal strings should be ASCII
> strings unless they have any characters with the high bit set or
> they have '\0' characters.
>
> Errors in combining strings with Unicode would probably be caught
> earlier this way. This would be a little different from the new
> byte array type, since binary strings would be immutable and share
> implementation with ASCII strings.
Alas, that's a language change, and an odd one at that. I'm not sure
that it solves the problem right. I first need to understand the
problem better though to think more about a solution.
So far the problem still sounds like "combining non-Unicode strings
and Unicode strings causes problems", and it's hard to propose a
single solution to that... :-(
--Guido van Rossum (home page: http://www.python.org/~guido/)