[Zope-dev] Some thoughts on splitter

Sin Hang Kin kentsin@poboxes.com
Fri, 14 Apr 2000 11:41:29 +0800


Yes! The pre-process approach is heading the right way.

However, I would like to suggest not inserting spaces, but some code that
does not alter the display. It seems that the unicode have set aside the
code ‌ which is called zero width non-joiner. The code can then stored
with the text and would be there for the editing and go through later
processing.

The index every char approach is not perferred due to this can be emulated
by the previous method if needed.

Also, the catalogue must use  unicode for cross-encoding search. It is well
known that Han have many encoding in big-5, gb2312, jis, etc. It is a good
practice to convert all code to unicode, normalized it, then perform the
splitting.

The convertion must based on language and encoding, however, most html do
not declare its language and encoding. I have seen some encoding detection
code based on checking the freq used han characters which give a good guess
of the encoding.

Rgs,

Kent Sin