[Zope-dev] Re: future searching of chinese text... (was: that JPython thing)

Michel Pelletier michel@digicool.com
Thu, 30 Mar 2000 16:02:07 -0800


Moved to zope-dev.

chas wrote:
> 
> >> 3. support of unicode
> >
> >Ah, as soon as python does, we do.  Also, soon there will be a Japanese
> >Vocabulary to support Japanese searching of text, and after that we are
> >going to try Chinese.  In Zope 2.2, using these examples, you can create
> >a Vocabulary object for the language of your preference.
> 
> Just on this subject :
> 
> a) You may find the Perl mandarin text-splitter at www.mandarintools.com

Yes I found those recently.

>    very useful. I rewrote it in Python once but my Perl sucks so you may
>    wish to do this yourself. Otherwise, e-mail me for copy ... it's so
>    short I'm sure you'd rather not trust mine though.

I don't know Perl at all so I'm probably not going to investigate this,
unless the algorithm is pretty simple.  Also the morphological
dictionary looks like it can get big, I'd like to use a more advanced
data structure than just a flat file dictionary; the splitter is used
very often, and constantly refering to a flat file would be horrible.  I
was probably going to use their dictionary to start with.
 
> b) What algorithm do you use in your searching of text ? Is it just a
>    simple frequency tally ?

I'm not sure what a simple frequency tally is.  A text index is a
mapping from a word id to a sequence of (documentid, score) tuples.  

  o The word id (which is an integer) is mapped to a word by the
Lexicon, 

  o the document_id (which is also an integer) is mapped to a 'document'
(and Zope object really) by the Catalog.  

  o The score is the number of times the word that maps to the word id
was found in the document that maps to the document id.  

Text indexes do not make any assumption about what a 'word' is, or how
it came to be a word.  All language specific information, such as how to
split a document in a certain language into words, is handled by the
Lexicon.

The splitting of a 'document' into words is done by the Splitter object
provided by the Lexicon.  The Splitter object is what would need to
implement the morphological analysis algorithm for whatever language you
are going for.  In the case of english, this is very simple and involves
splitting a document up at whitespaces.  The algorithm for Chinese or
Japanese is much harder, as you're aware.

-Michel