[Zope-dev] Stop words/vocabulary

Dieter Maurer dieter@handshake.de
Sat, 10 Feb 2001 13:40:17 +0100 (CET)


Hi Arno,

Arno Gross writes:
 > I have now a german stop word list and would like to
 > apply it for my current ZCatalog 'NewsCatalog'. But how? 
 > Or should I copy my list to the source (no good idea)?
I have told you, you can have stop words.

I did not tell you that you should not have them:

  In my view stop words are a bad thing, invented
  when computers were slow and storage expensive.

  The only thing they do now is make life more difficult:

    You search for a word that happens to be a stop
    word and you get no hits, usually without a useful
    problem indication.

    Phrase searches become a nightmare with stopwords
    (at least if one tries to stick to the correct
    semantics).
    
    If you change the stopword list, your index becomes inconsistent
    and needs reindexing.

    How should stopwords be handled with advanced
    search facilities such as phonetic searches,
    search patterns, mis-spelling tolerant searches.
    Everytime, you want to have a clear semantic
    specification for your searches, stop words
    come into your way.


  Thus, rethink about whether you really want to have stop words.


But I told you, you can have them.
And I will help you to get them, if you think, this is necessary.

The "Vocabulary" (Products.ZCatalog.Vocabulary.Vocabulary)
has a method "manage_stop_syn".
Currently is defined empty and not exposed as a view.
You could fill it with life and insert it in
"Products.ZCatalog.Vocabulary.Vocabulary.manage_options".

The "SearchIndex.Lexicon.Lexicon" has a method
"set_stop_syn" to set the stopword dict.
What you can do:

  Put your stopwords or synonyms into a file in Python dictionary
  syntax.
  Make the file selectable in "manage_stop_syn",
  read and "eval" it (this makes a Python dict), call
  "set_stop_syn".

As Chris pointed out, the GlobbingLexicon does not yet
support stopwords. The reason probably has been that
the author did not know, what stopwords and synonyms
should mean for a search process with wildcard characters.
If you know about it, then "GlobbingLexicon" is
easily extended along the lines of "Lexicon".


Dieter