[Zope] Weighing catalog searches per index ?

Casey Duncan casey at zope.com
Thu Jan 8 16:53:55 EST 2004


On Thu, 8 Jan 2004 16:24:58 -0500 
Jean-Francois.Doyon at CCRS.NRCan.gc.ca wrote:

> Casey,
> 
> Thanks for pointing out this product, I'll have to give it a try, as I
> can foresee many useful applications for it !

Cool. Its new and I'm eager to get feedback from the field on it (no pun
intended).
 
[...]
> 
> Your product seems to have a good base to start with.  The problem
> now, and one that stopped me in my tracks, is how to
> define/calculate/configure this"weighing" concept.  You suggest
> there's some underlying functionality for weighing already, maybe it'd
> just be a matter of taking advantage of it, and documenting how to use
> it ? The big question would be what does a weight of"1" MEAN versus a
> weight of "2" or "5" ?

ZCTextIndex calculates document and word scores. When queries are
performed these scores are combined as intermediate results are combined
(using unions and intersections). The weighted versions of these
commands allow you to weight one set differently than another. The
weight multiplies the score by some factor as the set operation is
performed.
 
> The other is how it gets purely implemented.  Does the weight need to
> be known at indexing time, or can it be provided at search time ? My
> hunch is the weighing should be applied at search time, so your
> product could be modified to take as input the weights to apply to
> each index that is being search through ?

Could be done either way. Weighing at index time might be more
efficient, but would not allow different weights to be applied for
different queries. I doubt that query-time weighting would slow things
down at all since it is already being done (only the weight factors are
always 1). All of the set operations are implemented in C.
 
> Something like:
> 
> result = catalog(dc_fields={"query":"Some search string",
> "fields":["Title","Description"]})
> 
> could become:
> 
> result = catalog(dc_fields={"query":"Some search string",
> "fields":["Title","Description"], "weights":[5,1]})

Sure or maybe:

result = catalog(dc_fields={"query":"Some search string", 
                           "weighted_fields":{'title':5,
'description':1})

This might be slightly less error prone (otherwise you need to match up
the lists}, if slightly less readable. :record marshalling for
weighted_fields could also be supported for queries from web forms.

Either spelling would work though and I'm open to input.
 
> Meaning apply a weight of 5 to Title, and 1 to Description.  Which I
> would in turn interpret as meaning Title is 5 times more important
> than Description (Not knowing any better right now).

Yes, scores for words found in the title would get multiplied by 5.
Scores for description would get multiplied by 1.
 
> Personally I'm using the Okapi algorithm.  When I started
> investigating this, I came to the (admitedly uneducated) conclusion
> that to do proper, fast weighing, then the Okapi implementation would
> have to be modified to support this feature (Maybe it does already
> ??), which is over my head, especially with the okascore module being
> Python/C.  Doing it in python would mean doing a second pass over the
> results that have already been scored once, which is innefficient it
> seems, and computationally intensive(Especially as I envision th efact
> that really really nice weighing algorythms would need to have all
> content in memory in order to do relational work between records).

I don't think the scoring algorithm would be affected what you propose.
I'd need to dig in a little deeper to be sure though.
 
> Anyways, that's what I've been thinking about ... But the benefits of
> having such a beast seem really tentalizing, so I thought I'd ask
> anyways ... Besides maybe I'm way out to left field on this and it's
> easier than I make it out to be ?! :)

I think this is a very compelling addition to the product. I'm going to
look at implementing it this weekend.

Thanks for the idea!

-Casey



More information about the Zope mailing list