[Zope3-dev] i18n, unicode, and the underline

Mon, 14 Apr 2003 18:08:08 +0200

Guido van Rossum wrote:
> > It does exist with user input, which can be anything, for instance
> > latin-1. The problem occurs when user start mixing (say) latin-1
> > with unicode. This seems to work in tests,
> 
> Only if the tests don't exercise non-ASCII characters.

Yes. Sometimes a system doesn't have such a good test coverage. :)
Zope 2 web apps aren't that easy to test, especially in the view
part with pagetemplates involved and web forms and such. 

> > until suddenly a user enters some non-ascii character, and then
> > suddenly code starts to give unicode errors in locations that are
> > not always easy to figure out.
> 
> Right.  So we need to be vigilant in all our I/O code.  I'm all for
> making sure that our I/O code always does the right thing with
> non-ASCII characters.  (And it seems that by and large this is already
> taken care of, right?)

Right. I'm just harping on this so nobody forgets. :) I talked a bit to
Kapil about the relational database code, and probably we need some more
hard rules for that as well. He wasn't sure about the details.

> > This is similar to how the use of one floating point number in an
> > integer calculation,
> 
> That's a very good analogy.  It even points out the real reason why
> int/int should return a float.  And it suggests that *any* difference
> in API for Unicode and 8-bit string methods is a problem.  (I think
> the last remaining difference is the signature of translate().)
> 
> > except that it's worse as the application can (but does not
> > necessarily, dependent on input) raise exceptions instead of just
> > not-quite-right outputs.
> 
> Using floats as e.g. sequence indices also raises an exception.

That's true, though that's outside the actual number manipulation code
usually. And this would always raise an exception, not just sometimes
dependent on input.

[snip]
> Right.  This kind of bug seems to be due to insufficient testing of
> the input code though, not due to inherent problems with Unicode.

I'm not suggesting inherent problems with Unicode or Python's implementation
of it at all. Considering the transition requirements I had to reluctantly
admit last summer that the current design is the right thing (reluctantly
as it caused me a lot of pain and I felt there should be a better way :). I 
sometimes wish it didn't do any automatic conversion from string to unicode
at all, but that is rather messy as well in other cases, so the current
transition situation looks like the best compromise. 

The problem is in existing frameworks and developer education. Since my 
experience last year I am trying to make sure we avoid most of the pain I went
through with Zope 3. :)

> > > Another way to get Unicode problems is that Unicode somehow got
> > > mixed in with ASCII and the ASCII got promoted to Unicode, after
> > > which point the I/O library can't deal with it (e.g. you can't
> > > write a Unicode string to a file or socket).
> > > 
> > > Which was it?
> > 
> > Both. This one also occured. Page Templates until Zope 2.6 were not
> > capable of dealing with unicode. This means that *all* the code that
> > needs to display unicode needs to be plugged with .encode()s all
> > over the place. Again, if you forget one, it'll work often (as a
> > str() would happen over it at some point), but sometimes it would
> > give a unicode error. This happens at the point of the str() within
> > the page template engine at which point it is hard to figure out
> > where the unicode was coming from.
> 
> Aha.  This description helps, except I'm unclear on where the
> .encode() calls have to be added.  And how would one know the correct
> encoding to use?  Or was UTF8 always right?

latin-1 in our case (or actually some freaky windows encoding which
extends it), though in retrospect UTF-8 would've been better to go with
and we'll do that transition with Silva soon.

The .encode() calls needed to be added in all Python scripts which accessed
the (ParsedXML) DOM and let contents from the DOM tree into page templates.
I.e. in all python scripts or page templates that retrieve unicode content
we'd need to do a manual encoding. If you forget one you're in trouble.
We had new unicode errors popping up around us for a long time, and 
we still see the occasional one. With Zope 2.6 we can chuck this code
though as page templates can now deal with unicode and do the right
output encoding.

> Sounds like page templates should have been fixed instead to deal with
> Unicode -- I wonder why that was not seen as an option.

We couldn't wait for a new Zope release. I investigated the option but
couldn't get it to work immediately. Since Zope 2.6 was released not that
long ago I think that was the right decision anyway. :)

[snip]
> > The other main problem is that systems are not built for unicode,
> > such as Zope 2.x.
> 
> Maybe it would have been better to stick with that rule, and disallow
> Unicode anywhere in Zope 2.x?  You could still work with encoded text.

We could've done that, but the die has long since been cast. :)

> > [snip three categories]
> > 
> > That's how I think about it. In Python we have ascii, bytes and unicode.
> > The confusion is that ascii and bytes are represented the same way. :)
> 
> Right.  It is an unfortunate effect of the way Python started out with
> a single string type that was used for both text and bytes.  I hope
> that in Python 3.0, this will be fixed by having all strings be
> Unicode and introducing a new datatype for bytes.  (This is Java's
> approach.)

Would it hurt to introduce a new datatype for bytes in Python 2.x? I think
that could increase the expressiveness of code that deals with bytes 
and would ease the transition to 3.0 as well. Would it cause backwards 
compatibility issues?

[snip]
> I wonder if using 8-bit strings encoded as UTF8 would have made things
> easier than using Unicode strings?

Possibly. It wouldn't have been according to the DOM standard though. But
of course in hindsight I would've cared less about that. :)

> > As long as we don't get too many cases where people wonder why they
> > can't put latin-1 in their string literals.
> 
> You should never put Latin-1 in 8-bit string literals *unless* your
> code *including your framework* is monolingual.  This clearly excludes
> Zope 3 *and all Zope 3 products* (even if a product is monolingual,
> the framework isn't).  Latin-1 in Unicode string literals works if the
> source file contains a "# -*- coding: Latin-1 -*-" cookie in line 1 or
> 2.

The problem of course is that some people complaining likely will have no
clue about what's ascii and what's latin-1. :)

Regards,

Martijn