[Zope3-dev] i18n, unicode, and the underline

Martijn Faassen faassen@vet.uu.nl
Tue, 15 Apr 2003 21:29:08 +0200


Guido van Rossum wrote:
> > > I'm glad you aren't.  I got feedback on Unicode from Jim that
> > > strongly suggested *he* thought we'd done it all wrong, and I want
> > > to nip this in the bud if I can.
> 
> [Martijn]
> > He was frustrated by the automatic promotion of 8-bit strings
> [containing ASCII only]
> > to unicode, which I understand.
> 
> This is a circular reasoning.  Jim's frustration doesn't come from the
> automatic promotion but from tracking down certain Unicode bugs in
> Zope 2, and he (misguidedly IMO) believed that without promotion these
> bugs would have been easier to track down.  (Not that they wouldn't
> have occurred at all!)

Perhaps I miss a step in the chain that makes this circular reasoning.
I'll try to explain. Later on I should mine this thread and put the
explanations in a document for public consumption.

As I recall he was tracking down unicode bugs in Zope 3 networking
code, not Zope 2 code, but I may be mistaken. 

I didn't say by the way that 8-bit strings containing ascii are automatically
promoted to unicode -- 8 bit string containing *anything* are automatically
promoted to unicode, or at least the attempt is made. If the string
*happens* to contain only 7 bit characters then the promotion is made
automatically, even though the encoding of the string might actually be
latin-1 (it just happens to be Dutch so it's often only ascii text :).

In addition, *bytes* being automatically promoted to unicode as soon as
unicode sneaks into the code can also be confusing, as in the case of
for instance certain HTTP code that requires ascii you don't catch that
happening right away (no unicode error), but only in a later
code path.

In case of Dutch, a certain code path may work for some cases where unicode
leaks as a str() is happening in some location, until suddenly a user
fills in a latin-1 only character somewhere, and you'll get a unicode
error at the location of the str() quite far removed from what the user
did. The user may successfully input the wrong data but some other output
path may suddenly fail with unicode errors.

In a case without promotion at all, our code using bytes would be 
raising an error in each and every case where we are passing unicode into
code that uses bytes in some way already (typically because we use
byte literals, like ''.join()). We can silence the error by doing
a manual encoding.

Currently we'd get an error likely somewhat further lower down in the
code, instead of soon after we made the mistake and the first combination
of unicode with a bytes is being made. Alternatively if our code happens
to do a str() sometimes (like Page Templates used to do), our code will
sometimes 'work' (if input happens to only contain ascii characters but is
actually latin-1), and sometimes fail.

Without promotion, in code using only unicode, we'd get an error each and 
every time when we feed in 8 bit strings. We can then silence the error by 
doing a manual decoding.

Currently we'd sometimes not get an error, if the 8-bit string happens to
contain only ascii characters but is actually latin-1.

Some code that uses absolutely no string or unicode literals and does not do 
any encoding and decoding would be generic for both cases, the same as it is
now.

Python refuses to add strings to numbers for very similar reasons; if it did 
allow you to add '1' + 1 and give you 2, we'd still get an exception as soon 
as the user happened to input 'foo' instead of '1', but this might occur 
further along the code path where it's harder to debug.

So I agree with Jim that without automatic promotion certain classes of
bugs would've been easier to track down. Jim and I are both misguided
that way. :) That's not to say I'm suggesting it should be this way.

> I'm quite sure that without such automatic promotion we'd all be very
> frustrated about the difficulty of converting text-processing code to
> Unicode.

Yes, I agree that the current scheme has a lot of benefits, so I decided
that upon reflection the current scheme was the best compromise given 
the constraints. It's just that it has drawbacks too, and pointing those
out is not circular reasoning. :)

> > I do wish there was some better way to turn this behavior off on a
> > per-module basis than going to site.py. But considering the
> > tradeoffs the current behavior seems to be the best compromise.
> 
> Why do *you* want to turn this behavior off?

See above. Of course I haven't thought this through in detail, but
I can imagine that if I'm writing a module that only deals with
text and not with bytes or I/O, I'd want to get an error as soon
as an alien 8-bit string sneaks in. In this module I'd also use
u'' religiously; if I don't, I'd get punished soon enough.

Or alternatively I'm only dealing with bytes or encoded 8-bit strings in
a module. In this module I wouldn't use 'u' anywhere; I'd get punished
immediately if I do anyway.

I guess the crucial case when I'd like the automatic conversion turned of
would be when I'm writing a *new* module. The other criterium would be
that the developer knows what he's doing.

> > [snip]
> > > It's not clear to me how never auto-converting would make your life
> > > easier.
> > 
> > You'd get an unicode error in the place where you made a mistake,
> > instead of in some later section of the code.  If you care about
> > 8-bit strings are bytes (which Jim does when he writes networking
> > code) then you want this behavior, for instance.
> 
> It should not be hard to typecheck the arguments to networking
> routines to make sure they are 8-bit strings; then the rest of the
> networking code won't have to worry about Unicode sneaking in.

Or just try a .encode('ascii') as that also works on non-unicode strings.
It's easy to forget one place and let the wrong thing sneak in, though, so
this can be a frustrating exercise. But yes, that's the best way.

> I also wonder if it would have eased the pain of tracking down those
> problems if the contents of the offending strings would have been
> shown in the error message.  That would probably have revealed their
> source.

Yes, that's an interesting idea. I think in many cases this would help,
though sometimes those strings would be very long.

> > That said, autoconverting can also make ones life easier, especially
> > in the transition environment we're in.
> 
> For example, most text processing code also uses string literals,
> either to search for or to insert (e.g. "<" or "\n").  It's a pure
> blessing that as long as these are ASCII, text processing code works
> for 8-bit text as well as for Unicode.

Agreed, that is a major benefit.

> [snip]
> 
> > I wasn't fully aware of the details of this last year, now I am, so
> > it's not rocket science. :)
> 
> You may be underestimating youreself. :-)

Thank you. :)

Regards,

Martijn