[Zope] Indexing: ZopeSplitter and numbers

Richard Jones richard@bizarsoftware.com.au
Wed, 14 Nov 2001 08:46:07 +1100


On Wednesday 14 November 2001 08:26, sean.upton@uniontrib.com wrote:
> I think I'd have to jump on the bandwagon and agree that numbers should not
> be stripped.  I'll second the idea of a fish-bowl proposal.
>
> In a full text search of classified ads, for example, one wants to search
> for a 2000 Ford F150; in Zope 2.3.x, Splitter.c stripped out both 2000 and
> F150.  The change was easy: just replace isalpha() with isalnum() in the
> relevant part of the code.  I'm not sure what the story is in 2.4, but it
> sounds like people searching for a year 2000 truck are going to find ads
> for ones built in 1982.

This is the behaviour we want - have you experienced any negative 
side-effects from doing this? 


> I use a modified Splitter.so that allows numbers, as well as one-character
> words, so people can search for "c programmer" in the classified ads.
>
> I'm curious about a few other things (that I really haven't tested):
> - How does Zope's splitter handle hyphenated words?
> - Is there a way to split words with period characters reliably, supposing
> I wanted to be able to search for terms like "yahoo.com" or "Splitter.so"
> or "Microsoft .NET" in text?

... or e-mail addresses. We 
currently sub the "@" and "." chars in e-mail addresses with "_" so they are 
indexed usefully. In your more case, I'm not sure that'd be appropriate. If 
you only have "keywords" in your TextIndex, I suppose the only stop chars 
you'd want are whitespace, and everything else is in.


> I would think that the appropriate default behavior for ZopeSplitter would
> be relaxed about stripping out things.

My concern is that there's _specific_ code in there that does this stuff, and 
I want to know if there'll be any negative consquences of changing its 
behaviour...


   Richard