[Zope-dev] Some thoughts on splitter

Michel Pelletier michel@digicool.com
Fri, 14 Apr 2000 07:44:48 -0700


Sin Hang Kin wrote:
> 
> Yes! The pre-process approach is heading the right way.
> 
> However, I would like to suggest not inserting spaces, but some code that
> does not alter the display. It seems that the unicode have set aside the
> code ‌ which is called zero width non-joiner. The code can then stored
> with the text and would be there for the editing and go through later
> processing.

I'm am averse to the idea of ZCatalog inserting information into
documents for its own purposes, I don't think this is good design, and I
doubt it's very portable.
 
> The index every char approach is not perferred due to this can be emulated
> by the previous method if needed.
> 
> Also, the catalogue must use  unicode for cross-encoding search. It is well
> known that Han have many encoding in big-5, gb2312, jis, etc. It is a good
> practice to convert all code to unicode, normalized it, then perform the
> splitting.

Sounds like a NormalizingSplitter of sorts.
 
> The convertion must based on language and encoding, however, most html do
> not declare its language and encoding. I have seen some encoding detection
> code based on checking the freq used han characters which give a good guess
> of the encoding.

Can you post these comments on the interfaces Wiki so they do not get
lost?

http://www.zope.org/Members/michel/Projects/Interfaces/Splitter

-Michel