[Zope-Dev] Some thoughts on splitter (Sin Hang Kin)

Michel Pelletier michel@digicool.com
Mon, 17 Apr 2000 07:07:01 -0700


Sin Hang Kin wrote:
> 
> 
> There were two things: 1. insert of the non-joiner to mark the break point
> of words.
> 2. The normalize process.
> 
> Step 1 will really change the document. But it is still not what zcatalog is
> doing. It is up to the content manager to decide to make that or not. If he
> decide to do so, he should prepare the content as required. Or make a
> pre-processor to do it. Only the splitter recognize what the non-joiner as a
> break point of the word. It is just like spliter recognize space and tab
> were word break point. Not zope make any decision that nobody wants.

Oh I see, in this case you would want a UnicodeSplitter.  Keep in mind
that the Splitter is an attribute of a Lexicon object, and any number of
Lexicons (Vocabularies in the Zope managment interface) can be created. 
In the case of a ZCatalog that wants to split documents formatted in
this way you describe, you would index them with a ZCatalog that used a
UnicodeSplitter to split on the non-joiner.  I understand what you are
getting at now.
 
> Step 2. is performed on making the index, just as you would do to capital
> the index terms. Not thing change the original content, just when zcatalog
> make the index, it convert the various encoding to unicode, make
> normalization, and optionally do more changes like stemming, sym combination
> etc. But all these will not change the content.

Actually, the index code wouldn't need to change at all, indexes map
'words' to documents that contain those words, but the index themselves
don't know anything about the words, they map 'word ids' (integers) to
the documents.  The object that reverse maps these word ids to words is,
once again, the Lexicon.  So, your UnicodeLexicon could do the
normalization you speak of, *and* provide the UnicodeSplitter that does
#1.  The class heirarchy would look like this:

            UnicodeVocabulary
                   ^
                   |
             UnicodeLexicon (provides)--> UnicodeSplitter

Other than coming up with the Splitter and the normalization code, no
changes at all would need to be done to the 2.2 ZCatalog to do what you
want.  This could be shipped as a clean third party product.

-Michel