[ZODB-Dev] [BTrees] bad performance of intersection() over ZEO

Wed Aug 4 13:55:42 EDT 2004

[Andreas Jung]
> this morning I discovered that a statement
>
> found = intersection(ids, idx) took up to 80 seconds.
>
> 'ids' -- OOTreeset (with up to 1000 items)
> 'idx' -- OOBTree (with up to 180K items, loaded through ZEO).
>
> I assume that the worse performance comes from loading the whole OOBTree
> through ZEO.

Probably, yes.

> Replacing the code with
>
> found = OOSet()
> for id in ids.keys():
>    if idx.has_key(id):
>       found.insert(id)
>
> reduced the processing time to less than one second.

Keep doing that, then <wink>.

> Is this a flaw in the implementation of intersection() or just
> broken-by-design?

I'd say neither.  It's due to a lack of optimization for this particular
case.  There are really no optimizations now:  union, weighted union,
intersection, weighted intersection, and set difference are all done by a
single routine, set_operation() in SetOpTemplate.c.  While the code is quite
involved already, it's conceptually simple:  march over both inputs, one at
a time each, as if doing a two-sorted-list merge, and accumulate elements
from neither, one, or both inputs (depending on the operation) as you go
along.  The worst-case time requirement is O(len(input1) + len(input2)), and
the best-case time is often no better (for example, it's never better when
doing a union).

But if, for example, you were to intersect the sets

    {1, 2, 3, ..., 10000000000}

    {3}

it would reach the end of the second input after seeing just the first 3
elements of the first input, and it would stop then.

So intersection speed is very dependent on the input set contents, as well
as on input set sizes.

There are more elaborate approaches that *could* get a lot more speed in
many cases (including yours, intersection of sets of very different sizes,
where the largest element of the smaller input lies "near the end" of the
larger input).  This is a good introduction to them:

    "Adaptive Set Intersections, Unions, and Differences" (2000)
    Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro

Python 2.3's adaptive merge sort uses some of those techniques to very good
effect.  They're hard to code and get right, though, and harder for ZODB's
variety of collection types than for Python's dead-simple lists.  Trying to
do so anyway has been on my todo list for years, but so far down the
priority stack I doubt I'll ever get to it.

In the meantime, optimization (when needed) is up to you, based on
exploiting your knowledge of the inputs.