[Zope-dev] Re: future searching of chinese text... (was: that JPython thing) thing)

chas panda@skinnyhippo.com
Sat, 01 Apr 2000 17:16:10 +0800


Hi Michel,

> Moved to zope-dev.

Oops, I'm not on zope-dev, so thanks for cc'ing.

>>    very useful. I rewrote it in Python once but my Perl sucks so you may
>>    wish to do this yourself. Otherwise, e-mail me for copy ... it's so
>>    short I'm sure you'd rather not trust mine though.
>
>I don't know Perl at all so I'm probably not going to investigate this,
>unless the algorithm is pretty simple. 

It is. It's really just a lot of if/else's. (Btw, I also suspect
there's a flaw in the logic on one of the loops but it seems to produce
accurate results. I say 'seems' b/c I don't read much Chinese and had to
rely on people telling me if the results were correct - scary, I know,
but several MNC's didn't complain.)

> Also the morphological
>dictionary looks like it can get big, I'd like to use a more advanced
>data structure than just a flat file dictionary; the splitter is used
>very often, and constantly refering to a flat file would be horrible.  I
>was probably going to use their dictionary to start with.

You're right about the dictionary. My Python scripts took 3-4 seconds
to read in the dictionary and create the Python lists/dictionaries.
Once this was read in, the splitting was fast. This was acceptable for
the indexer/spider but not for the search field.

I tried to use a dbf file, then a pickle but that didn't help that much..
although I suspect that's b/c I screwed up somewhere along the line.

This is also why one of my first posts to the Zope list (months ago) 
asked if it were possible to read this into memory once and once only... 
ie. whether external methods are persistent. 

>> b) What algorithm do you use in your searching of text ? Is it just a
>>    simple frequency tally ?
>
>I'm not sure what a simple frequency tally is. 

I made the term up :) I just meant exactly what you wrote below -
the score for a document being the frequency of the word in the 
document. 
If I were happy with this mechanism, then I'd be use the ZCatalog for
searching. Unfortunately, I need something a little better and, inspired
by an article I read about google, have been researching datamining, and 
other mechanisms for superposing accuracy... but starting to realize it's
a bit beyond me. :(

>Text indexes do not make any assumption about what a 'word' is, or how
>it came to be a word.  All language specific information, such as how to
>split a document in a certain language into words, is handled by the
>Lexicon.
>
>The splitting of a 'document' into words is done by the Splitter object
>provided by the Lexicon.  The Splitter object is what would need to
>implement the morphological analysis algorithm for whatever language you
>are going for.  In the case of english, this is very simple and involves
>splitting a document up at whitespaces.  The algorithm for Chinese or
>Japanese is much harder, as you're aware.

Yes, and neither you nor I really want to rebuild that algorithm I think.
There are several works out on the web (mandarintools.com is just one of them)
that have dealt with this; this might be of interest : 
http://casper.beckman.uiuc.edu/~c-tsai4/chinese/wordseg/mmseg.html#Abstract
(it's actually moved but, being in China, I can't access geocities/xoom etc)

chas