[Zope-dev] "stemmed and stopped": problems with stopwords and the 'and' operator

R. David Murray bitz@bitdance.com
Thu, 17 Aug 2000 19:28:15 -0400 (EDT)


On Thu, 17 Aug 2000, Martijn Pieters wrote:
> No clues as to where you'll find the stopword code, but the Persistence
> thingy is caused by the magic that ZODB performs: it initializes the
> correct Persistence module when it itself is imported. This way Jim
> managed to have ZODB3 and BoboPOS2 exist in the same Zope distribution.
> 
> Do an import ZODB before you do your Splitter import, and all will be
> dandy.

Thanks, worked like a charm.

I think I've found the stopword code.  To cement my understanding
I'm going to write this up.  Maybe somebody will find it useful <grin>.

UnTextIndex accesses the splitter through the Splitter method of the
Lexicon associated with the index.  That Lexicon instance is created
when the Vocabulary or Catalog are created.  (Comments in the code
indicate that in the future each TextIndex could have its own Lexicon,
which makes sense to me.)  A Lexicon instance can be passed a list
of stop words (and/or synonyms) when it is initialized.  Vocabulary
does this for Lexicon (but not GlobbingLexicon, which internal
comments indicates does not use stopwords).  The Lexicon instance
stores this list in a property, and passes it to the real Splitter
when its Splitter method is called.

So the fix that I submitted earlier today to the collector for the 'and'
involving stopwords should work for 'listed' stopwords as well as
the punctuation and numbers that I was able to test it on.  (In my
comments in the patch I said I wasn't sure).  I still can't test it
because I'm using a Globbing lexicon <wry grin>.

In perusing the code I'm also feeling more confident that the
change I made to __getitem__ in that fix is in fact semantically
correct.  Or at least consistent with the rest of the __getitem__ code.

GlobbingLexicon not using stopwords also explains the few hits
on 'the and car' that I got that I was confused by.  Those entries
really must have 'the' as an indexed term, unlike the rest.

Oh, by the way, the comments in TextIndex seem to agree with me
as to the conventional meaning of the word 'stemmed' <grin>.

--RDM