unicode and ZCTextIndex, WAS:RE: [Zope] strange unicode behaviour

Dieter Maurer dieter@handshake.de
Fri, 25 Jul 2003 02:20:11 +0200


Giuseppe Bonelli wrote at 2003-7-24 16:21 +0200:
 > Thanks to all who responded to my original post, particularly to Toby
 > who pointed me in the right direction: I was mistakenly (and stupidly
 > ...) using:
 > 
 > <meta http-equiv="content-type"
 > content="text/html;charset=&dtml-encoding;">
 > instead of:
 > <dtml-call
 > "RESPONSE.setHeader('content-type','text/html;charset=utf-8')">
 > 
 > in my standard_html_header, so I was encoding on the browser, but not
 > over http !!!
 > 
 > This solved everything, but an issue remains:

I hit this same problem earlier.
I never understood why the meta "http_equiv=content-type" did not
work, just recognized that it did not work reliably.

Do you know why it does not work?

 > I started fiddling with encoding, when I wanted to full text index my
 > utf-8 encoded unicode content with ZCTextIndex and the lexicon gave me
 > the usual ordinal not in range decoding error when building the index.

I remember that (at least early) ZCTextIndex could not handle
Unicode (see the mailing list archives).
Andreas' TextIndexNG has been proposed as a fully Unicode aware
alternative.

Be careful, though:

  All indexes, independent of type, built upon BTrees.
  BTrees require that their keys are persistently ordered.
  This implies usually that they must all have the same type.

  Mixing Unicode and non-Unicode keys can result in
  corrupted indexes (less likely) or implicit conversions
  (more likely) with potential "encoding errors".

 > ...
 > Specifically:
 > 1. Does the standard ZCTextIndex coming with Zope 2.6.1 support this ?

I do not know.

The "cvs" (--> "cvs.zope.org") could tell you what changes
were done to ZCTextIndex since the bug report.

 > 2. If yes, do I need to start Zope with a particular locale ?

For "Unicode", no special locale should be necessary.
However, this is dependent on the splitter.
A splitter might descide that it uses locale information
even for unicode strings.

 > 3. Regarding these issues, is the recently released TextIndexNG ver.2 a
 > better solution ?

Andreas is very confident that TextIndexNG handles unicode very
well.


Dieter