[Zope3-dev] i18n, unicode, and the underline

Barry Warsaw barry@python.org
14 Apr 2003 10:39:12 -0400


On Mon, 2003-04-14 at 09:23, Martijn Faassen wrote:

> I doubt the problem is really big with literals, as it's fairly easy to
> make sure they're ascii only. It does exist with user input, which can
> be anything, for instance latin-1. The problem occurs when user start
> mixing (say) latin-1 with unicode. This seems to work in tests, until suddenly
> a user enters some non-ascii character, and then suddenly code starts to
> give unicode errors in locations that are not always easy to figure out.

This jives with my experience in internationalizing Mailman.  The
primary leakage of encoded strings for me was gettext, which by default
can return encoded 8-bit strings containing non-ascii (moral: always use
.ugettext()).  But user input in web forms and email also caused
leakage.  As an aside: I think Mailman 2.1(.2) papers over most of the
problems but it needs a real audit to eradicate encoded non-ascii 8-bit
strings.

Martijn's right though, the problem usually isn't literals.  But encoded
+ Unicode mixing can happen in surprising places, such as string
interpolation and other places that Guido described.

Education is the key.  I know it took me a really long time to get all
the concepts straight in my head -- if I even have by now <wink>.  I'd
like to see a howto or something that can help the next crop of i18n'ing
Pythonistas.  I'd even help write one if I can free up some time.

> This is what I just described. This hurts if you have a complicated application
> where suddenly all input-holes need to be plugged. If you forget to convert
> the input in one point, you will get an error later, but this may not
> be so easy to detect, as for many inputs the error does not occur if you
> for instance write Dutch in latin-1.

This is definitely the most painful part.  It is much more common for
the reporting of the error to be far removed in both time and code from
where the error actually occurred.

-Barry