charlie at begeistert.org
Sun Jan 18 09:49:05 EST 2009
Am 29.12.2008 um 15:01 schrieb Charlie Clark:
>> The site should deliver all pages containing forms (if possible even
>> all pages) with a single charset, let's call it the "site charset".
>> Then it uses this same charset to interpret form data.
> While I understand this, I'm a bit at a loss as to why this is
> happening. I'm using forms based on CMFDefault's formlib
> implementation. Charsets are set for the site and zpublisher but
> something else is probably missing.
Delving deeper into this I think I understand things a little better.
The accept-charset attribute on a form tag requires the browser to
encode any form data in the specific encoding. Ideally this would make
additional negotiation unnecessary but this value isn't passed to the
server as the HTTP_ACCEPT_CHARSET which is where the fun starts. As
has been noted previously, http://mail.zope.org/pipermail/zope3-dev/2004-June/011483.html
, browsers don't all behave themselves when setting this header: IE
6 + 7 and Safari set an empty header whereas Opera and Firefox usually
set something like "iso-8859-1, utf-8, utf-16, *;q=0.1"
getPreferredCharsets() will return 'iso-8859-1' where
HTTP_ACCEPT_CHARSET is empty. But this will cause problems if the
browser is actually using UTF-8. But the way the CMF uses
getPreferredCharsets() is right either:
""" Get charset preferred by the browser.
envadapter = IUserPreferredCharsets(request)
charsets = envadapter.getPreferredCharsets() or ['utf-8']
This will always be iso-8859-1 for Opera and Firefox because all
charsets have the same quality, again even if UTF-8 encoding is
specified. I haven't been able to track where the decoding of form
data occurs for Zope 2 stuff but I can identify the problem in
def _decode(self, text):
"""Try to decode the text using one of the available
if self.charsets is None:
envadapter = IUserPreferredCharsets(self)
self.charsets = envadapter.getPreferredCharsets() or
for charset in self.charsets:
text = unicode(text, charset)
Here the naive assumption is that we decode from a charset without an
error then we have the correct charset. Sometimes this goes unnoticed
but with characters like u2013 and u2014 (en-dash and em-dash) it will
raise errors as those codepoints are not in the Latin-1 charset but it
has it's own equivalents.
I would suggest that we work towards enforcing UTF-8 in where possible
but at the very least add the accept-charset attribute to forms and
use the portal's default_charset for this.
I'd very much appreciate your comments on this.
More information about the Zope-CMF