[ZODB-Dev] Re: Incremental Search Engine (was: RE: [BTrees] considerable optimization potential in"intersection")

Mon Aug 9 10:08:43 EDT 2004

On Sat, 7 Aug 2004 14:27:31 +0200
Dieter Maurer <dieter at handshake.de> wrote:
[..]
>   The principle for an and query of subqueries q1, ... qn
>   looks like:
> 
> 	# determine a hit >= candidate
> 	candidate = q1.getCandidate(candidate) # hit >= candidate
>   	while True:
> 	    foreach q:
> 	        (accepted, candidate) = q.classify(candidate)
> 	        if accepted == CANDIDATE: break # new candidate > old
> 	        candidate if accepted == END: return END # no more hits
> 	        # accepted == HIT; candidate unchanged
> 	    if accepted == CANDIDATE: break # restart with new candidate
> 	    # accepted == HIT; candidate is the next hit
> 	    return candidate
>             
>   This means: the first query provides a candidate.

The candidate is a single record or a set?

>   All subqueries are asked whether they can accept this candidate.
>   They may answer "END" (no more hits), "HIT" (candidate accepted),
>   "CANDIDATE" (a new candidate larger than the former one).
>   When all subqueries accept the candidate, it is the next hit.

The principal of the algorithm I think makes sense to me, however I am
confused by the pseudocode above. What does it mean for a candidate to
be greater than another?

Also, I'm confused by the third line from the end. How does breaking
there restart? Is there an omitted outer loop?

Also the last line does a return, would that be a yield (i.e., would
this be a generator?)

>   The worst case behaviour occurs when all subqueries have the
>   same result set. Otherwise, the "excess" elements in individual
>   results can be efficiently skipped.
> 
>   The worst case time is "n * len(shortest result) *
>   classificationtime".
> 
>   For a BTrees set "query", the classification time is logarithmic
>   with respect to the set size.
> 
> 
> I expect that this new incremental search engine
> will speed up most non-trivial searches which contain at least
> one "and" operator -- sometimes by orders of magnitude.
> Queries consisting only of or's will get slightly slower (by a
> logarithmic factor).

It will certainly be interesting to see.

> Sorting the sequence of subqueries based on expected classification
> time and specificity can significantly reduce the overall search time.
> This can make it feasible to allow non indexed based queries, e.g.
> queries that examine the real object or complex neighborhood searches.

I do something along these lines in my pypes search engine (a ZODB
embedded db project I am working on), it tries to opportunistically
evaluate cheap portions of a query and put off the more expensive bits
as much as possible (possibly forever if only a small portion of the
result is actually used).

-Casey