[ZODB-Dev] Re: Incremental Search Engine (was: RE: [BTrees] considerable optimization potential in"intersection")

Mon Aug 9 13:39:31 EDT 2004

Casey Duncan wrote at 2004-8-9 10:08 -0400:
>On Sat, 7 Aug 2004 14:27:31 +0200
> ...
>>   The principle for an and query of subqueries q1, ... qn
>>   looks like:
>> 
>> 	# determine a hit >= candidate
>> 	candidate = q1.getCandidate(candidate) # hit >= candidate
>>   	while True:
>> 	    foreach q:
>> 	        (accepted, candidate) = q.classify(candidate)
>> 	        if accepted == CANDIDATE: break # new candidate > old
>> 	        candidate if accepted == END: return END # no more hits
>> 	        # accepted == HIT; candidate unchanged
>> 	    if accepted == CANDIDATE: break # restart with new candidate
>> 	    # accepted == HIT; candidate is the next hit
>> 	    return candidate
>>>   This means: the first query provides a candidate.
>
>The candidate is a single record or a set?

A single document id.

>>   All subqueries are asked whether they can accept this candidate.
>>   They may answer "END" (no more hits), "HIT" (candidate accepted),
>>   "CANDIDATE" (a new candidate larger than the former one).
>>   When all subqueries accept the candidate, it is the next hit.
>
>The principal of the algorithm I think makes sense to me, however I am
>confused by the pseudocode above. What does it mean for a candidate to
>be greater than another?

It has a larger document id (document ids are integers).

>Also, I'm confused by the third line from the end. How does breaking
>there restart? Is there an omitted outer loop?

You are right! The "break" should be a "continue".

>Also the last line does a return, would that be a yield (i.e., would
>this be a generator?)

The "return" is correct.

The code above may be called from an outer incremental search
which does not request successive hits but skips large
hit ranges (by passing it an appropriate candidate).

We therefore want to reenter the "while True" loop.

> ...
>I do something along these lines in my pypes search engine (a ZODB
>embedded db project I am working on), it tries to opportunistically
>evaluate cheap portions of a query and put off the more expensive bits
>as much as possible (possibly forever if only a small portion of the
>result is actually used).

Very good!

If we could efficiently generate and classify candidates
based on a semantically useful ordering (e.g. if "document id"
were allocated incresingly and not randomly),
then the pseudo code above could be used in the same spirit
to produce the most relevant (with respect to this ordering)
incrementally and stop as soon as sufficient hits have been found.

-- 
Dieter