[Zope3-Users] Re: Unicode for Stupid Americans (like me)?

Philipp von Weitershausen philipp at weitershausen.de
Wed Feb 28 11:38:43 EST 2007


Jeff Shell wrote:
> I continue to feel like an idiot in the face of Unicode. I finally
> understand what a unicode 'string' really is, and what encode and
> decode mean (they were previously interchangable in my mind). But I
> don't know the best practices.
> 
> My desire is to:
> 
> - Not have any encode / decode errors. 'ascii codec doesn't recognize
> character ... at position ...'. I don't want to keep on bullying
> through whenever this pops up.

You can't just simply do str(some_unicode) or unicode(some_str), unless 
you really know that you're only dealing with the ASCII subset in both 
cases. Use explicit encodings to convert.

Now, the trick is obviously to know the encoding. A 'str' object is 
worth squat if you don't know the encoding that goes along with it. In 
other words, (some_str, encoding) is isomorph to a unicode object.

> - Not turn customer input into garbage. It may render to the public
> site fine, but sometimes in the admin skin's text areas, things turn
> funky. I don't know if there's something I need to do at form-handling
> time, or at rendering time, or what... I did a test based on a
> document by Sam Ruby, and guess that I'm often getting Latin-1 from
> our customers, which doesn't map to UTF-8 (the diacritic marks go
> haywire).
> 
>  - HOW do I know what a browser has sent me? There doesn't seem to be
> a real way of handling this. Do I guess?

That's sorta what zope.publisher does. Actually, it figures that if the 
browser sends an Accept-Charset header, the stuff that its sending to us 
would be encoded in one of those encodings, so it tries the ones in 
Accept-Charset until it's lucky. It falls back to UTF-8.

This seems to work. But yeah, it's relying on implementation details of 
the browser and it's weird.

> - Know without a doubt when to encode, and when to decode. I guess the
> "proper" thing to do is to store everything as unicode, and to decode
> to unicode as early as possible when input is coming in.

Absolutely correct.

> But again,
> how do I know when to decode from latin-1 and when to decode from
> UTF-8? When or why should I encode to one or the other at response
> time? Should I worry at all?

If you're using Zope, you don't have to encode outgoing text at all, 
unless you're setting a non-text content-type on the outgoing response. 
If the context-type is text/*, you can just return unicode from your 
browser view and zope.publisher will use the best encoding that the 
browser prefers (from Accept-Charset). "Best" meaning that if the 
browser accepts latin-1,utf-8 and your page contains Korean text, it'll 
use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that 
there's no chance to not be able to encode.

You can, of course, encode yourself in the browser view. You can pick 
pretty much any encoding you like, all you have to do is tell the 
browser about it in the response header (Content-Type: 
foo/bar;charset=your-encoding).

> If there are any documents, web pages, Zope 3 book chapters, and past
> messages that I may have missed or need to look at in more detail,
> please let me know. I've had a hard time sifting through all of the
> information, and I apoligize if I've missed something written by
> anyone here.

I'm wondering if I make this clear enough in my book. It's always hard 
to tell by myself since these things seem obvious to me. If you got any 
constructive feedback regarding this, I'll be more than happy to hear it 
and consequently improve the book for you "Stupid Americans" :).

HTH

-- 
http://worldcookery.com -- Professional Zope documentation and training
Next Zope 3 training at Camp5: http://trizpug.org/boot-camp/camp5



More information about the Zope3-users mailing list