[Zope-Coders] Analysis: BTrees and Unicode and Python

Andreas Jung Andreas Jung" <andreas@zope.com
Fri, 19 Oct 2001 12:08:26 -0400


----- Original Message -----
From: "Guido van Rossum" <guido@python.org>
To: "Andreas Jung" <andreas@zope.com>
Cc: "Jim Fulton" <Jim@zope.com>; <zope-coders@zope.org>
Sent: Friday, October 19, 2001 11:52
Subject: Re: [Zope-Coders] Analysis: BTrees and Unicode and Python



> > - one of these earlier comparision checks a Python string (containing
> >   and accented character) against a unicode string and raises a
> >   unicode exception  (ASCII decoding error: ordinal notr in range(128)).
> >   I assume because the default encoding is ascii.
>
> Note that this was a conscious design decision.  Not all the world
> uses Latin-1, and many real-world programs and data use different
> interpretations of 8-bit characters with the high bit set.  Assuming
> Latin-1 when comparing to Unicode would be wrong.

I assume the exception is raised before calling the PyUnicode_Compare
function. Otherwise silently ignoring this error condition is also not
a solution so I agree that Python behaviour is reasonable :)

>
> > - there is no check in the BTree code to check for an exception after
> >   PyObject_Compare() and so this error got never cleared
>
> This should be fixed before proceeding.

jup !

>
> > - when when trying to compare two identical unicode strings, Python
> >   calls default_3_way_compare() and runs into the following code:
> >
> >
> > static int
> > default_3way_compare(PyObject *v, PyObject *w)
> > {
> >     int c;
> >     char *vname, *wname;
> >
> >     if (v->ob_type == w->ob_type) {
> >         /* When comparing these pointers, they must be cast to
> >          * integer types (i.e. Py_uintptr_t, our spelling of C9X's
> >          * uintptr_t).  ANSI specifies that pointer compares other
> >          * than == and != to non-related structures are undefined.
> >          */
> >         Py_uintptr_t vv = (Py_uintptr_t)v;
> >         Py_uintptr_t ww = (Py_uintptr_t)w;
> >         puts("\t\t\tdefcmp 1");
> >         return (vv < ww) ? -1 : (vv > ww) ? 1 : 0;
> >     }
> >
> >   This code returns -1 for the two identical unicode strings.
> >
> > I am not sure if this code is able to compare two unicode strings.
> > On the other hand it is still strange that the unittest works when
> > replacing the same unicode string in the list with the testdata in the
> > unittest
> > with self.s as described earlier.
> >
> > Any ideas about that ?
>
> It is definitely a bug if comparison of two unicode strings ends up
> calling default_3way_compare()!
>
> This normally doesn't happen though -- the Unicode object's comparison
> code is generally called.
>
> I'd like to see what's on the stack when default_3way_compare is
> called with two Unicode objects.

How can I determine that ?

> Which Python version is this?  2.1 or 2.1.1?

2.1.1

Andreas