[Zope3-dev] i18n, unicode, and the underline

Mon, 14 Apr 2003 10:43:56 -0400

> > > Guido van Rossum wrote:
> > > > So I don't see how an application could get away with using "..." 
> > > > containing non-ASCII text at all -- it would fail as soon as the
> > > > literal was being output.
> > 
> > [Martijn]
> > > Note that in Zope 2 (at least until 2.6 which is a bit better) this
> > > can lead to obscure unicode errors occuring in unexpected places,
> > > dependent on user input and so on.
> > 
> Guido van Rossum wrote:
> > I've heard this before, and it seems to be the main reason why Zope 3
> > has chosen an (IMO unnecessary) anal attitude towards Unicode.  But
> > nobody has ever been able to show me what the problems were.  (I'm not
> > doubting there were problems; I just need more details to understand
> > what was the matter.)
> 
> From: Martijn Faassen <faassen@vet.uu.nl>
> 
> I doubt the problem is really big with literals, as it's fairly easy
> to make sure they're ascii only.

Agreed, but whenever I complained to Jim about u"" literals containing
only ASCII text, he read me the law about always using Unicode. :-(

> It does exist with user input, which can be anything, for instance
> latin-1. The problem occurs when user start mixing (say) latin-1
> with unicode. This seems to work in tests,

Only if the tests don't exercise non-ASCII characters.

> until suddenly a user enters some non-ascii character, and then
> suddenly code starts to give unicode errors in locations that are
> not always easy to figure out.

Right.  So we need to be vigilant in all our I/O code.  I'm all for
making sure that our I/O code always does the right thing with
non-ASCII characters.  (And it seems that by and large this is already
taken care of, right?)

> This is similar to how the use of one floating point number in an
> integer calculation,

That's a very good analogy.  It even points out the real reason why
int/int should return a float.  And it suggests that *any* difference
in API for Unicode and 8-bit string methods is a problem.  (I think
the last remaining difference is the signature of translate().)

> except that it's worse as the application can (but does not
> necessarily, dependent on input) raise exceptions instead of just
> not-quite-right outputs.

Using floats as e.g. sequence indices also raises an exception.

> > I'd like to see at least one report of an actual case, because
> > there are a few different ways that Unicode can cause errors.
> 
> > One way to get Unicode problems is non-ASCII in 8-bit text that
> > should have been converted to Unicode (which means that the
> > encoding must be known).
> 
> This is what I just described. This hurts if you have a complicated
> application where suddenly all input-holes need to be plugged. If
> you forget to convert the input in one point, you will get an error
> later, but this may not be so easy to detect, as for many inputs the
> error does not occur if you for instance write Dutch in latin-1.

Right.  This kind of bug seems to be due to insufficient testing of
the input code though, not due to inherent problems with Unicode.

> > Another way to get Unicode problems is that Unicode somehow got
> > mixed in with ASCII and the ASCII got promoted to Unicode, after
> > which point the I/O library can't deal with it (e.g. you can't
> > write a Unicode string to a file or socket).
> > 
> > Which was it?
> 
> Both. This one also occured. Page Templates until Zope 2.6 were not
> capable of dealing with unicode. This means that *all* the code that
> needs to display unicode needs to be plugged with .encode()s all
> over the place. Again, if you forget one, it'll work often (as a
> str() would happen over it at some point), but sometimes it would
> give a unicode error. This happens at the point of the str() within
> the page template engine at which point it is hard to figure out
> where the unicode was coming from.

Aha.  This description helps, except I'm unclear on where the
.encode() calls have to be added.  And how would one know the correct
encoding to use?  Or was UTF8 always right?

Sounds like page templates should have been fixed instead to deal with
Unicode -- I wonder why that was not seen as an option.

> > > I now know how python's unicode system works, but most Python
> > > developers really don't, in my experience.
> > 
> > That (i.e. understanding) seems to be the core problem.
> 
> Agreed. And the error messages happen on occasions that only
> increase the confusion of the developer. Often the developer is
> doing something else and suddenly these nasty unicode errors pop up,
> and the temptation is great to just find a way to paper over matters
> without any new insight.

Yes, programmers are like that, not just with Unicode.  It is the
kind of temptation that a good programmer knows to resist, however --
or at least knows to put on a stack of important things to come back
to later (once the fire-du-jour is extinguished) to fix it right,
rather than papering over with ever more brittle "solutions".

> The other main problem is that systems are not built for unicode,
> such as Zope 2.x.

Maybe it would have been better to stick with that rule, and disallow
Unicode anywhere in Zope 2.x?  You could still work with encoded text.

> I think in Zope 3 it will be much cleaner, as the framework takes
> care of the conversion to unicode and back in most cases, and
> developers within this framework can just deal with unicode most of
> the time.

Right.  That's why I'm fighting Unicode superstition.

> > Maybe it helps to explain that instead of thinking about 8-bit
> > strings vs. Unicode strings, it helps to think of THREE
> > categories.  These are:
> 
> [snip three categories]
> 
> That's how I think about it. In Python we have ascii, bytes and unicode.
> The confusion is that ascii and bytes are represented the same way. :)

Right.  It is an unfortunate effect of the way Python started out with
a single string type that was used for both text and bytes.  I hope
that in Python 3.0, this will be fixed by having all strings be
Unicode and introducing a new datatype for bytes.  (This is Java's
approach.)

> > > The bugs are nasty to track down and don't encourage developer
> > > enlightenment either - the response is "oh, one of these unicode
> > > weird errors again", without understanding what the right
> > > approach should be.
> > 
> > That's why I'd like to see some samples of these, rather than
> > blanket reports.
> 
> The problem is that these examples occur in large frameworks in
> transition from ascii to unicode. We had many of these issues when I
> decided we should follow the DOM standard with Silva, and represent
> DOM strings as unicode as opposed to 8 bit strings. Since before we
> didn't, our code fed latin-1 encoded strings right into the DOM, and
> plugging all the holes was a big operation. Since Page Templates at
> the time could not handle unicode strings, this compounded the
> problem on the output side as well. It was not possible to just
> convert everything at the end of the whole page template pipeline;
> instead all unicode strings needed to be converted before they
> entered the process.

I wonder if using 8-bit strings encoded as UTF8 would have made things
easier than using Unicode strings?

> > > Now in Zope 3 this unicode confusion may not take place as Zope
> > > 3 is unicode in the core, and input and output are frequently
> > > automatically translated. If confusion is indeed less, using
> > > ascii literals may be easy to explain. But if it would lead to
> > > anything like the Zope 2.x situation, where non-ascii user input
> > > can cause the system to complain at unexpected code paths, then
> > > let's please remain hard-line (even to the point of silliness)
> > > about unicode.
> > > 
> > > It's far easier to explain "use unicode everywhere" than to
> > > explain the other alternatives.
> > 
> > So I've noticed, and that leads to the (in my eyes) abomination of
> > u"..."  literals containing ASCII strings.  We need to stamp out
> > this misunderstanding!!!
> 
> As long as we don't get too many cases where people wonder why they
> can't put latin-1 in their string literals.

You should never put Latin-1 in 8-bit string literals *unless* your
code *including your framework* is monolingual.  This clearly excludes
Zope 3 *and all Zope 3 products* (even if a product is monolingual,
the framework isn't).  Latin-1 in Unicode string literals works if the
source file contains a "# -*- coding: Latin-1 -*-" cookie in line 1 or
2.

> If the Zope 3 framework doesn't let this happen (at least if it
> complains right away, not somewhere deep in a code path), then this
> should be reasonly simple to explain and developers will learn
> quickly enough.

Right.

> The question of abomination is less clear also if you start speaking
> about other languages. For instance, *most* Dutch strings could be
> contained in ascii literals, but not all (accented e occurs in a
> flew places, as well as the 'trema'). Does this means all literals
> in Dutch should be in unicode, or only those that contain these
> characters?

That's up to a particular project's coding guidelines.  The only rule
that Zope 3 imposes should be not to put non-ASCII in 8-bit string
literals.

> And what about German or French?  If you say all literals should be
> unicode, then these questions are avoided.

For German and French you'll be typing u"..." most of the time anyway,
so using it all the time is no big deal.  For English, they're noise;
for Dutch, they're noise most of the time (I've been typing Dutch
without accents and tremas for 25 years now :-).

--Guido van Rossum (home page: http://www.python.org/~guido/)