[Zope-dev] SearchIndex Splitter lowercase indexes?

Fri, 25 May 2001 09:17:28 +0200

Hi Michel,

Michel Pelletier wrote:
>The splitter should really be a modular component.  That's what
>vocabularies were origninally for, to store language specific artifacts
>like word lists and splitters.  For example, stripping the "ing" suffix
>obviously only makes sense in English.  so if you want to change this
>behavior, make your own vocabulary with its own custom splitter.
>
>This is because each language has very different splitting requirements,
>and even different meanings of the word "word".  Imagine, for example,
>splitting Japanese or one of the Chinese languages (based textualy on
>Kanji).

Just imagine German! There are composite words without spaces or other
non-aphanumeric characters between them.

>Identifying words in Kanji is a very hard problem.  In romance langauge,
>it's easy, words are seperated by spaces, but in Kanji words are
>diferentiated by the context of the surrounding characters, there are no
>"spaces".  Splitting Kanji text requres a pre-existing dictionary and some
>interesting heuristic matching algorithms.  And that's only half of
>Japanese itself, really, since there are two other alphabets (hiragana and
>katagana) that *are* character-phonetic like romance langauges, and all
>three alphabets are commonly mixed together in the same sentence!  Chinese
>language may also have these phonetic alphabets.

The same applies for German: You'd need a huge dictionary with word stems,
exceptions, and stop words.
Stems of many words change in different cases, too.

>In other words, it's not an easy problem!  There is going to be an
>unimaginable culture clash when asian and other non-romance languages
>catch up to the volume of romance language content on the web.

Well, English or German in fact aren't romance languages, they're germanic
:-)

Eric