[Zope-Dev] Some thoughts on splitter (Sin Hang Kin)

Christian Wittern chris@ccbs.ntu.edu.tw
Sat, 22 Apr 2000 19:22:03 +0800


Michel Pelletier wrote:
> Christian Wittern wrote:
> >
> > As soon as ZCatalog starts using Unicode,
>
> Keep in mind that the ZCatalog will not use Unicode at all.  In fact,
> the ZCatalog pretty much works with integers the whole time for
> efficiency.  There is nothing language specific in ZCatalog.
>

[very interesting explanation deleted ... ]
>
> What words 'are' is determined by the Splitter, which is also provided
> by the Vocubulary object.  This is because, like the words themsevles
> being very specific to a language, so are the semantics which define
> them.

I see much clearer now. I started looking at the source in CVS, but it
somehow differs from what I see in 2.1.6 on Windows. Has there been changes
in this area? What files should I look at?

>
>
> > To accomodate this, there have to be some changes to the way
> searches are
> > done as well: On most search engine, giving a few search terms
> separated by
> > whitespace means ANDing them for the search, which is fine.
>
> oh ok, I can see how this is not ideal because it could possibly false
> match other words that contained your search characters in a different
> order.

Right, but it is not that much of a problem, I think the interface could
take care of that.
>
> > If this is not
> > desired however, most search engines allow the user to use quotes to
> > indicate the terms should be used as a phrase. Unfortunately,
> Zope does not
> > support this yet. I think it is highly desirable!!!
>
> We did to, which is why text indexes do support phrase matching with
> quotes.  This is hold over code from ZTables and I did not write it or
> change it at all, so maybe it is broken?  Have you tested it?  Just
> search for "a phrase".

I don't think "a phrase" would work, because 'a' is a stopword. On 2.1.6 I
created a textindex as described in the ZCatalog Howto, but this type of
search does not seem to work, even in cases where no stopwords are involved.
>
> > If ZCatalog would support this type of search, this could be
> used for Asian
> > languages and searches would return results where to or more
> characters are
> > searched for, by looking for documents, where they occur in sequence.
> >
> > Does this make any sense?
>
> Yes, I can see how this rather handily gets around needing an expensive
> up front parsing into semantic chunks, the equivalent of Asian 'words'.
> This would actually not be difficult to implement at all.  What is the
> benefit then of pre-parsing documents into semanticly defined 'words'
> instead of just indexed sequences of characters?  The only one I can
> think of is index space, since the vocabulary and the number of index
> references would go down quite a bit with some up front smart
> processing.
>

Yes, the index space would go down, but the wordlist in the splitter would
have to be pretty big and would be very domain specific. The only problem
with the approach described above is that you could find things that are not
a word and see things where you look for two characters, and get a result
where they are a substring of a longer word. --  This could be regarded as a
bug or a feature:-)

All the best,

Christian