[Zope3-dev] Re: [Zope3-Users] PsycopgDA problem

Thu Mar 23 22:33:23 EST 2006

[moving to zope3-dev]

Dmitry Vasiliev wrote:
> Stuart Bishop wrote:
>> Dmitry Vasiliev wrote:
>>> Also I've fixed broken tests, added CHANGES.txt and removed "utf-8"
>>> encoding because it doesn't work for me. For now the only reliable way
>>> to set encoding is through the configuration form.
>>
>> What was broken with the Unicode support? This change is a regression
>> as the
>> client encoding is now stuck with whatever the default client encoding
>> happens to be set to on the backend and allows the situation where the DA
>> and the backend are attempting to talk totally different encodings.
> 
> It seems that 'SET client_encoding TO UNICODE' doesn't work as expected
> in our configuration (we use 'cp1251' encoded database for historical
> reasons) so we always got an UnicodeError with this option.

If your getting a UnicodeError, this indicates a bug at the Python level.
I've just created a fresh database using CP1251 and PostgreSQL 8.1. Using
psql I can successfully connect to it, set the client_encoding to UTF8 and
insert UTF8 character sequences in, which PostgreSQL stores as CP1251
sequences. I can also repeat this with Python:

14:10:51~/lp $ psql -l
               List of databases
           Name           |  Owner   | Encoding
--------------------------+----------+----------
 foodb                    | stub     | WIN1251

Python 2.4.2 (#2, Mar  5 2006, 00:03:25)
[GCC 4.0.3 20060212 (prerelease) (Ubuntu 4.0.2-9ubuntu1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from psycopg import connect
>>> con = connect('dbname=foodb')
>>> cur = con.cursor()
>>> cur.execute('set client_encoding to utf8')
>>> cur.execute('show client_encoding')
>>> cur.fetchone()[0]
'utf8'
>>> cur.execute('delete from foo')
>>> cur.execute(u"insert into foo values ('\N{COPYRIGHT SIGN}')".encode('utf8'))
>>> cur.execute('select * from foo')
>>> print '\n'.join(r[0] for r in cur.fetchall())
©
>>> cur.execute(u"insert into foo values ('\N{BIOHAZARD SIGN} not in
cp1251')".encode('utf8'))
>>> cur.execute("SELECT * FROM foo")
>>> print '\n'.join(r[0] for r in cur.fetchall())
©
 not in cp1251
>>>

PostgreSQL just issues a warning and silently drops characters it receives
that it cannot re-encode into the database encoding. Which is rather scary!

I think I know what might have been going on. The earlier patch neglected to
change getEncoding() to always return UTF-8. So the code in zope.app.rdb
that encodes the queries was encoding them to whatever getEncoding()
happened to return rather than UTF8.

Would you be happy with this if it works for you?

The alternative that meets my requirements would be to make the possible
encodings a Choice field of the encodings that recent PostgreSQL supports,
(which needs a mapping from the PostgreSQL naming to the Python naming). The
default would be UTF-8, and the connection would issue a 'SET
client_encoding' statement using the selected encoding rather than trusting
that whatever encoding has been selected in Z3 happens to be the default
client encoding. There would be a slight performance improvement with this
method in the case where a non-UTF8 database could receive data in its
native encoding, but I'd personally rather not have the added complexity.

>> If there is a reason you cannot communicate with your backend using UTF-8
>> after setting the client encoding to UTF-8, then a better approach for
>> the
>> DA would be to query the client encoding and use that which avoids the
>> possible mismatch (although this still means people are stuck using
>> whatever
>> the default encoding is on the backend).
> 
> MySQLDbDa already did this but some versions of MySQL returns wrong
> encoding name (see
> http://mail.zope.org/pipermail/zope3-users/2005-September/001033.html)
> so it was fixed to use getEncoding() method.
> 
> I think that input encoding name into the form not so big problem and
> moreover this solution works across different database versions.

Right. I'm only testing with PostgreSQL 8.1 here.

The getEncoding() approach isn't meaningful with database adapters that
accept Unicode strings which is becoming the norm. The idea that database
adapters will never be Unicode aware seems to have been embedded into Z3
quite deeply though :-( I guess we can allow getEncoding() to return None
for modern database drivers and make Z3 pass through queries as Unicode or
ASCII strings to driver in this case. I'm going to want to make the switch
to psycopg2 soon, and I think I'll need to sort this out in Z3 at that
point. I don't know yet if it will be better to maintain a single psycopgda
or if it will be better to split it into two products.

-- 
Stuart Bishop <stuart at stuartbishop.net>
http://www.stuartbishop.net/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 191 bytes
Desc: OpenPGP digital signature
Url : http://mail.zope.org/pipermail/zope3-dev/attachments/20060324/60cfbd71/signature.bin