[Zope3-dev] i18n, unicode, and the underline

Martijn Faassen faassen@vet.uu.nl
Mon, 14 Apr 2003 15:23:03 +0200


Guido van Rossum wrote:
> > Guido van Rossum wrote:
> > > So I don't see how an application could get away with using "..." 
> > > containing non-ASCII text at all -- it would fail as soon as the
> > > literal was being output.
> 
> [Martijn]
> > Note that in Zope 2 (at least until 2.6 which is a bit better) this
> > can lead to obscure unicode errors occuring in unexpected places,
> > dependent on user input and so on.
> 
> I've heard this before, and it seems to be the main reason why Zope 3
> has chosen an (IMO unnecessary) anal attitude towards Unicode.  But
> nobody has ever been able to show me what the problems were.  (I'm not
> doubting there were problems; I just need more details to understand
> what was the matter.)

I doubt the problem is really big with literals, as it's fairly easy to
make sure they're ascii only. It does exist with user input, which can
be anything, for instance latin-1. The problem occurs when user start
mixing (say) latin-1 with unicode. This seems to work in tests, until suddenly
a user enters some non-ascii character, and then suddenly code starts to
give unicode errors in locations that are not always easy to figure out.

This is similar to how the use of one floating point number in an integer
calculation, except that it's worse as the application can (but does not
necessarily, dependent on input) raise exceptions instead of just
not-quite-right outputs.

> I'd like to see at least one report of an actual case, because there
> are a few different ways that Unicode can cause errors.

> One way to get Unicode problems is non-ASCII in 8-bit text that should
> have been converted to Unicode (which means that the encoding must be
> known).

This is what I just described. This hurts if you have a complicated application
where suddenly all input-holes need to be plugged. If you forget to convert
the input in one point, you will get an error later, but this may not
be so easy to detect, as for many inputs the error does not occur if you
for instance write Dutch in latin-1.

> Another way to get Unicode problems is that Unicode somehow got mixed
> in with ASCII and the ASCII got promoted to Unicode, after which point
> the I/O library can't deal with it (e.g. you can't write a Unicode
> string to a file or socket).
> 
> Which was it?

Both. This one also occured. Page Templates until Zope 2.6 were not capable
of dealing with unicode. This means that *all* the code that needs to display
unicode needs to be plugged with .encode()s all over the place. Again, if you
forget one, it'll work often (as a str() would happen over it
at some point), but sometimes it would give a unicode error. This happens
at the point of the str() within the page template engine at which point it
is hard to figure out where the unicode was coming from.

> > I now know how python's unicode system works, but most Python
> > developers really don't, in my experience.
> 
> That (i.e. understanding) seems to be the core problem.

Agreed. And the error messages happen on occasions that only increase the
confusion of the developer. Often the developer is doing something else
and suddenly these nasty unicode errors pop up, and the temptation is great
to just find a way to paper over matters without any new insight.

The other main problem is that systems are not built for unicode, such
as Zope 2.x. I think in Zope 3 it will be much cleaner, as the framework
takes care of the conversion to unicode and back in most cases, and developers
within this framework can just deal with unicode most of the time.

>  Maybe it
> helps to explain that instead of thinking about 8-bit strings
> vs. Unicode strings, it helps to think of THREE categories.  These
> are:

[snip three categories]

That's how I think about it. In Python we have ascii, bytes and unicode.
The confusion is that ascii and bytes are represented the same way. :)

> > The bugs are nasty to track down and don't encourage developer
> > enlightenment either - the response is "oh, one of these unicode
> > weird errors again", without understanding what the right approach
> > should be.
> 
> That's why I'd like to see some samples of these, rather than blanket
> reports.

The problem is that these examples occur in large frameworks in transition
from ascii to unicode. We had many of these issues when I decided we should
follow the DOM standard with Silva, and represent DOM strings as unicode
as opposed to 8 bit strings. Since before we didn't, our code fed latin-1
encoded strings right into the DOM, and plugging all the holes was a big
operation. Since Page Templates at the time could not handle unicode strings,
this compounded the problem on the output side as well. It was not possible
to just convert everything at the end of the whole page template pipeline;
instead all unicode strings needed to be converted before they entered the
process.

> > Now in Zope 3 this unicode confusion may not take place as Zope 3 is
> > unicode in the core, and input and output are frequently
> > automatically translated. If confusion is indeed less, using ascii
> > literals may be easy to explain. But if it would lead to anything
> > like the Zope 2.x situation, where non-ascii user input can cause
> > the system to complain at unexpected code paths, then let's please
> > remain hard-line (even to the point of silliness) about unicode.
> > 
> > It's far easier to explain "use unicode everywhere" than to explain
> > the other alternatives.
> 
> So I've noticed, and that leads to the (in my eyes) abomination of
> u"..."  literals containing ASCII strings.  We need to stamp out this
> misunderstanding!!!

As long as we don't get too many cases where people wonder why they can't
put latin-1 in their string literals. If the Zope 3 framework doesn't let
this happen (at least if it complains right away, not somewhere deep in a
code path), then this should be reasonly simple to explain and developers
will learn quickly enough.

The question of abomination is less clear also if you start speaking about
other languages. For instance, *most* Dutch strings could be contained
in ascii literals, but not all (accented e occurs in a flew places, as well
as the 'trema'). Does this means all literals in Dutch should be in unicode,
or only those that contain these characters? And what about German or French?
If you say all literals should be unicode, then these questions are avoided.

Regards,

Martijn