Incremental Search Engine (was: RE: [ZODB-Dev] [BTrees] considerable
optimization potential in"intersection")
Dieter Maurer
dieter at handshake.de
Sat Aug 7 08:27:31 EDT 2004
Tim Peters wrote at 2004-8-6 14:36 -0400:
>[Dieter Maurer]
>> I will implement the intersection in Python and maybe later convert it
>> into C.
>
>Excellent! And, however this turns out, that makes a lot more sense than
>doing it in the other order <wink>. Let us know how it's going.
It will become an incremental search engine:
The current search engine evaluates subqueries into
sets and then uses set operations to combine these
partial results into the final result set.
As a consequence, large intermediate results can be constructed.
A prominent example is the CMF CatalogTool subquery
"effective <= now and now <= expires". The "effective <= now"
part yields almost all objects.
Other prominent examples are all types of range queries
and pattern matching queries.
Building large intermediate result sets reads lots of
BTree derived instances.
The new search engine will not build any intermediate
sets but interleave index lookups and deliver
one final hit at a time.
The principle for an and query of subqueries q1, ... qn
looks like:
# determine a hit >= candidate
candidate = q1.getCandidate(candidate) # hit >= candidate
while True:
foreach q:
(accepted, candidate) = q.classify(candidate)
if accepted == CANDIDATE: break # new candidate > old candidate
if accepted == END: return END # no more hits
# accepted == HIT; candidate unchanged
if accepted == CANDIDATE: break # restart with new candidate
# accepted == HIT; candidate is the next hit
return candidate
This means: the first query provides a candidate.
All subqueries are asked whether they can accept this candidate.
They may answer "END" (no more hits), "HIT" (candidate accepted),
"CANDIDATE" (a new candidate larger than the former one).
When all subqueries accept the candidate, it is the next hit.
The worst case behaviour occurs when all subqueries have the
same result set. Otherwise, the "excess" elements in individual
results can be efficiently skipped.
The worst case time is "n * len(shortest result) * classificationtime".
For a BTrees set "query", the classification time is logarithmic
with respect to the set size.
I expect that this new incremental search engine
will speed up most non-trivial searches which contain at least
one "and" operator -- sometimes by orders of magnitude.
Queries consisting only of or's will get slightly slower (by a logarithmic
factor).
Sorting the sequence of subqueries based on expected classification
time and specificity can significantly reduce the overall search time.
This can make it feasible to allow non indexed based queries, e.g.
queries that examine the real object or complex neighborhood searches.
Of course, (arbitrary) intersection is a special case of searching.
--
Dieter
More information about the ZODB-Dev
mailing list