[Zope-dev] SearchIndex Splitter lowercase indexes?

Michel Pelletier michel@digicool.com
Thu, 24 May 2001 21:55:56 -0700 (PDT)


On Thu, 24 May 2001, Christian Robottom Reis wrote:

> On Thu, 24 May 2001, Michel Pelletier wrote:
>
> > This is a very common indexing strategy to save space and make searches
> > more relevant.  Otherwise 'Dog' and 'dog' would return two completely
> > different result sets.
>
> Fine. However:
>
> >>> s.indexes('Foo')
> []
>
> Is _this_ supposed to happen, too?

Yes.  The splitter was applied to the document before it was indexed so
both Foo and foo became foo and there is no Foo.  The index itself is
technicaly not case insensitive, it's case flattened, which makes the
query interface case insensitive.

> Ah, I guess to. It's the problem with
> using this outside of Zope. :-)

No, you just didn't apply the splitter before you queried the index.

results = []
for word in Splitter("search for these words or foo"):
  results = results + s.indexes(word)

> Uhhh, no, it _is_ implemented. It just didn't work like I'd expect :-)
>
> >>> index.positions(1,['crazy'])
> [2]
> >>> index.positions(1,'crazy')
> []
> >>> index.positions(1,['Crazy'])
> []

I see, yes it must be a sequence and you must also apply the splitter to
your input before querying an index.

> So it does look lowercase words up. Of course, this is an artifact of the
> following point you make:
>
> > you want to look up things in a text index, use the same splitter to munge
> > the content before querying the index, otherwise, you may end up not
> > finding what you're looking for.
>
> This makes sense:
>
> >>> s = Splitter("Crazy")
> >>> index.positions(1,s)
> [2]
>
> Ahhm. Okay. Will update my documentation with this important point.

Ah I see you came to the answer yourself.  Yes this is an important point,
especially for other languages where the splitter *must* be applied to
extract the words from context, like Japanese.

> > In other words, it's not an easy problem!  There is going to be an
> > unimaginable culture clash when asian and other non-romance languages
> > catch up to the volume of romance language content on the web.
>
> Fascinating points on i18n and l10n of the indexing mechanism. Makes me
> wonder how far the current implementation will go before having to be
> rewritten, and if the world will survive east-meets-the-west of computing
> text.

Digital Garage implemented a JVocabulary and have sucessfully cataloged
japanese text.  I wonder if htdig or php can do that <grin>.

> But I believe the Splitter could stay the same for western languages, from
> what I've seen of the code. Can't really see the ing-cutting stuff here.

Oh, well maybe it used to remove common suffixes and I took it out.  It's
called stemming, and it's a pretty common pattern.  But you'd be suprised
how many people run into english only quirks even in western languages
with Zope's splitter.

-Michel