Incremental Search Engine (was: RE: [ZODB-Dev] [BTrees] considerable optimization potential in"intersection")

Sat Aug 7 08:27:31 EDT 2004

Tim Peters wrote at 2004-8-6 14:36 -0400:
>[Dieter Maurer]
>> I will implement the intersection in Python and maybe later convert it
>> into C.
>
>Excellent!  And, however this turns out, that makes a lot more sense than
>doing it in the other order <wink>.  Let us know how it's going.

It will become an incremental search engine:

  The current search engine evaluates subqueries into
  sets and then uses set operations to combine these
  partial results into the final result set.

  As a consequence, large intermediate results can be constructed.
  A prominent example is the CMF CatalogTool subquery
  "effective <= now and now <= expires". The "effective <= now"
  part yields almost all objects.
  Other prominent examples are all types of range queries
  and pattern matching queries.

  Building large intermediate result sets reads lots of
  BTree derived instances.

  The new search engine will not build any intermediate
  sets but interleave index lookups and deliver
  one final hit at a time.

  The principle for an and query of subqueries q1, ... qn
  looks like:

	# determine a hit >= candidate
	candidate = q1.getCandidate(candidate) # hit >= candidate
  	while True:
	    foreach q:
	        (accepted, candidate) = q.classify(candidate)
	        if accepted == CANDIDATE: break # new candidate > old candidate
	        if accepted == END: return END # no more hits
	        # accepted == HIT; candidate unchanged
	    if accepted == CANDIDATE: break # restart with new candidate
	    # accepted == HIT; candidate is the next hit
	    return candidate

  This means: the first query provides a candidate.
  All subqueries are asked whether they can accept this candidate.
  They may answer "END" (no more hits), "HIT" (candidate accepted),
  "CANDIDATE" (a new candidate larger than the former one).
  When all subqueries accept the candidate, it is the next hit.

  The worst case behaviour occurs when all subqueries have the
  same result set. Otherwise, the "excess" elements in individual
  results can be efficiently skipped.

  The worst case time is "n * len(shortest result) * classificationtime".

  For a BTrees set "query", the classification time is logarithmic
  with respect to the set size.

I expect that this new incremental search engine
will speed up most non-trivial searches which contain at least
one "and" operator -- sometimes by orders of magnitude.
Queries consisting only of or's will get slightly slower (by a logarithmic
factor).

Sorting the sequence of subqueries based on expected classification
time and specificity can significantly reduce the overall search time.
This can make it feasible to allow non indexed based queries, e.g.
queries that examine the real object or complex neighborhood searches.

Of course, (arbitrary) intersection is a special case of searching.

-- 
Dieter