[Zope] Indexing: ZopeSplitter and numbers

sean.upton@uniontrib.com sean.upton@uniontrib.com
Tue, 13 Nov 2001 14:03:37 -0800


Keep in mind that my production setup is still using Zope 2.3.2.  YMMV.  

The only downside we found from hacking the splitter was that wildcard
searches on single character keywords are bad, bad, bad (they consume a lot
of resources).  We are working on addressing this on an application by
application basis (with our classifieds application, we rewrite users
queries anyways to auto-add wildcards to words longer than 3 characters, so
checking and this and stripping the wildcards from the search should be
possible for words of 2 or less characters).

Numbers cause no problems, for us, at least.  Again, this was with Zope
2.3.2...

Sean

-----Original Message-----
From: Richard Jones [mailto:richard@bizarsoftware.com.au]
Sent: Tuesday, November 13, 2001 1:46 PM
To: sean.upton@uniontrib.com; andreas@zope.com; c.duncan@nlada.org;
zope@zope.org
Subject: Re: [Zope] Indexing: ZopeSplitter and numbers


On Wednesday 14 November 2001 08:26, sean.upton@uniontrib.com wrote:
> I think I'd have to jump on the bandwagon and agree that numbers should
not
> be stripped.  I'll second the idea of a fish-bowl proposal.
>
> In a full text search of classified ads, for example, one wants to search
> for a 2000 Ford F150; in Zope 2.3.x, Splitter.c stripped out both 2000 and
> F150.  The change was easy: just replace isalpha() with isalnum() in the
> relevant part of the code.  I'm not sure what the story is in 2.4, but it
> sounds like people searching for a year 2000 truck are going to find ads
> for ones built in 1982.

This is the behaviour we want - have you experienced any negative 
side-effects from doing this? 


> I use a modified Splitter.so that allows numbers, as well as one-character
> words, so people can search for "c programmer" in the classified ads.
>
> I'm curious about a few other things (that I really haven't tested):
> - How does Zope's splitter handle hyphenated words?
> - Is there a way to split words with period characters reliably, supposing
> I wanted to be able to search for terms like "yahoo.com" or "Splitter.so"
> or "Microsoft .NET" in text?

... or e-mail addresses. We 
currently sub the "@" and "." chars in e-mail addresses with "_" so they are

indexed usefully. In your more case, I'm not sure that'd be appropriate. If 
you only have "keywords" in your TextIndex, I suppose the only stop chars 
you'd want are whitespace, and everything else is in.


> I would think that the appropriate default behavior for ZopeSplitter would
> be relaxed about stripping out things.

My concern is that there's _specific_ code in there that does this stuff,
and 
I want to know if there'll be any negative consquences of changing its 
behaviour...


   Richard