[Zope3-dev] i18n, unicode, and the underline

Tue, 15 Apr 2003 12:51:54 +0200

Guido van Rossum wrote:
> > I'm not suggesting inherent problems with Unicode or Python's
> > implementation of it at all.
> 
> I'm glad you aren't.  I got feedback on Unicode from Jim that strongly
> suggested *he* thought we'd done it all wrong, and I want to nip this
> in the bud if I can.

He was frustrated by the automatic promotion of 8-bit strings
to unicode, which I understand. I do wish there was some better way to turn 
this behavior off on a per-module basis than going to site.py. But considering
the tradeoffs the  current behavior seems to be the best compromise.

[snip]
> It's not clear to me how never auto-converting would make your life
> easier.

You'd get an unicode error in the place where you made a mistake, instead
of in some later section of the code. If you care about 8-bit strings are
bytes (which Jim does when he writes networking code) then you want this 
behavior, for instance. 

That said, autoconverting can also make ones life easier, especially in the
transition environment we're in.

> > Would it hurt to introduce a new datatype for bytes in Python 2.x? I
> > think that could increase the expressiveness of code that deals with
> > bytes and would ease the transition to 3.0 as well. Would it cause
> > backwards compatibility issues?
> 
> This should be done, but there's no time for Python 2.3.  I'd
> appreciate help in writing a PEP.  The new bytes datatype would be
> entirely separate from strings; there'd have to be a new "super binary
> mode" for files to return bytes instead of strings from read().

I'll give it some thought.

> > [snip]
> > > I wonder if using 8-bit strings encoded as UTF8 would have made things
> > > easier than using Unicode strings?
> > 
> > Possibly. It wouldn't have been according to the DOM standard
> > though. But of course in hindsight I would've cared less about
> > that. :)
> 
> I thought that the DOM only required Unicode support and didn't spell
> out how you did it.  Why wouldn't UTF8 be good enough?

>From the DOM standard (Core, level 2):

Applications must encode DOMString using UTF-16 (defined in [Unicode] and
Amendment 1 of [ISO/IEC 10646]). The UTF-16 encoding was chosen because of its
widespread industry practice. Note that for both HTML and XML, the document
character set (and therefore the notation of numeric character references) is
based on UCS [ISO-10646]. A single numeric character reference in a source
document may therefore in some cases correspond to two 16-bit units in a
DOMString (a high surrogate and a low surrogate).

One can argue with this, of course, but that's what the spec says. :)

> > The problem of course is that some people complaining likely will
> > have no clue about what's ascii and what's latin-1. :)
> 
> There's no hope for them.

No, there's hope; they simply need to be educated. I wasn't fully aware of
the details of this last year, now I am, so it's not rocket science. :)

Regards,

Martijn