[Zope3-dev] i18n, unicode, and the underline

Martijn Faassen faassen@vet.uu.nl
Tue, 15 Apr 2003 12:11:37 +0200


Shane Hathaway wrote:
> Martijn Faassen wrote:
> >Shane Hathaway wrote:
> >
> >>May I suggest that while Python's Unicode support is transitional, all 
> >>methods and functions that expect to manipulate Unicode should convert 
> >>strings to Unicode at runtime?  Not all functions would have to do this, 
> >>only those that concatenate strings (I think).
> >
> >
> >Hm, I think that this is a bad idea:
> >
> >  * how can I convert strings to unicode if I don't know what encoding the
> >    string is in?
> >
> >  * why pay for this performance and code complexity impact?
> >
> >If your framework properly uses unicode, this is endless overhead on the
> >programmer for no good reason..
> 
> That's what I thought when I made the transition from C++ to Java.  I 
> was pretty skeptical, but here's what I figured out:
> 
> - The source file should be written in 7-bit ASCII, so the default 
> encoding doesn't matter.  (I *think* that's the story.)

I don't understand this what you mean there. The encoding of user input
still matters (even though in this case the encoding of literals will always
be ascii).

I understood you (perhaps wrongly) to be saying that
functions should check their input and convert it to unicode if it's 
an 8 bit string. I objected to that.

Later on after reading Guido's mail I concluded you might
be meaning something else, which is that all such functions should *output*
unicode, and not typecheck their input. That's better, but I still find it
dubious why I need to pay attention to it. If Zope 3 is unicode in the
core and guards I/O, we are either talking about bytes (and we can use 8-bit
strings), or we're talking about unicode (and 7-bit ascii also happens to 
work fine with that, so ascii literals will work fine).

The value of making sure all functions returns unicode even if it gets 
7-bit ascii input therefore would be that if you enter non-ascii 8 bit
strings, you'll get a unicode error pretty early on. Is this why you
were suggesting this strategy?

> - The code doesn't increase in complexity as long as the core functions 
> accept either strings or Unicode.  Except when you're doing I/O or C 
> extensions, you can forget that you're using Unicode at all.  If you 
> have to add Unicode later for I18N, you'll pay a much higher price in 
> complexity.

I don't understand; code can't deal with strings or unicode. Code can
deal with ascii & unicode, but not *strings* and unicode.

> - The only real difference between 8-bit character strings and 32-bit 
> character strings is 24 bits per character. :-)  Modern processors deal 
> with either kind of string with virtually equal speed.  The only cost is 
> in conversion between the formats, and if your program is typical, the 
> conversion only needs to happen when it's communicating with the outside 
> world.

Now I really don't understand what you're suggesting anymore. 

I'm suggesting we use unicode in the core and convert when dealing with
the outside world (I/O, that means the network, dealing with files, and 
relational databases). Ascii literals in code are also fine as they'll
work fine with unicode. Even ascii strings that are input are okay, if
you are absolutely sure they *are* ascii, and commonly you aren't, so 
I'd recommend against this as it leads to errors rather easily.
8 bit strings are *not* fine as there's no way to convert those to unicode
automatically unless you know their encoding, which usually we can't assume
anything about.

But perhaps you're suggesting the same. What is it exactly that you area
suggesting? :) 

Regards,

Martijn