[Grok-dev] Problem with character encoding (Solution for the record...)

Sebastian Ware sebastian at urbantalk.se
Wed Jul 9 05:41:05 EDT 2008


The problem was related to the fact that the receiving server expected  
the url-encoded non-standard characters to be UTF-8 encoded unless I  
specifically specified 'charset=iso-8859-1' in the "Content-type"  
header. When I added charset it all worked fine.

Mvh Sebastian

9 jul 2008 kl. 00.21 skrev Luciano Ramalho:

> On Tue, Jul 8, 2008 at 6:49 PM, Sebastian Ware  
> <sebastian at urbantalk.se> wrote:
>> Many thanks for your patience Luciano! I wish I was just tired, but
>> unfortunately it is the character encoding that confuses me :(
>
> You are very welcome, Sebastian!
>
>> I was expecting
>>
>>  u'å'.encode('iso-8859-1')
>>
>> to encode the unicode string to a 'iso-8859-1' encoded string, but  
>> as you
>> are pointing out, it returns a two byte encoding.
>
> No, it returns a one byte encoding, which is represented by an hex
> character code when the Python console displays it:
>
>>>> c = u'å'.encode('iso-8859-1')
>>>> c
> '\xe5'
>>>> len(c)
> 1
>>>>
>
>> However, it is eventually
>> encoded properly by urllib.urlencode and allows me to (in this  
>> case) send an
>> sms with non-ascii characters.
>>
>> The spec I need to meet is:
>>
>> -perform a http-post with a 'iso-8859-1' encoded string
>>
>> I can do it in the python interpreter, but once I use a string  
>> stored in the
>> Zodb, non-ascii characters go bonkers...
>
> I really don't see what the ZODB has to do with it.
>
> I think you are getting confused by the fact that Python actually has
> two string types today: str and unicode. You use the str.decode method
> to convert from a string in particular encoding (such as iso8859-1 or
> utf-8) to unicode, and unicode.encode to do the opposite: convert a
> unicode object to a str object, using a certain encoding to do it.
>
> Take a look... c is a str containing the ISO-8859-1 char for å (one  
> byte)
>
>>>> c = u'å'.encode('iso-8859-1')
>>>> c
> '\xe5'
>>>> len(c)
> 1
>
> Now I convert it to a unicode object, containing the same char (here,
> len does not tell me the number of bytes, but the number of characters
> in the unicode object, which is really what matters to us most of the
> time):
>
>>>> u = c.decode('iso-8859-1')
>>>> u
> u'\xe5'
>>>> len(u)
> 1
>
> If we convert the same unicode object back to str, but using the UTF-8
> encoding, the result is a two-byte str:
>
>>>> t = u.encode('utf-8')
>>>> t
> '\xc3\xa5'
>>>> len(t)
> 2
>
> Hth!
>
> Cheers,
>
> Luciano



More information about the Grok-dev mailing list