[ZODB-Dev] Re: [Bug] Another FileStorage packing bug

Thu Jan 15 13:05:51 EST 2004

Hi Jeremy,

Jeremy Hylton wrote at 2004-1-14 17:49 -0500:
>On Tue, 2004-01-13 at 05:51, Dieter Maurer wrote:
>> Instead is was extremely bad runtime behaviour in connection
>> with "_resolve_backpointer" and "_data_find".
>> With the ZODB 3.1 implementation (at least) it is easy
>> to get quadratic runtime behaviour with respect to the number
>> of objects contained in a transaction.
>
>If I follow correctly, the problem is that for each backpointer in a
>transaction, you try to find the transaction for that backpointer.  You
>always do a linear search through the file, starting at the end, to find
>the transaction for that backpointer.  Even if there are N objects all
>in the same transaction, you do the search N times.

You are right: there are two potentially expensive operations:

  *  searching the transactions

  *  searching the object record within the transaction

My patch addresses only the second one.

> ...
>I'm not sure I understand what your patch is doing.  Could you explain
>the strategy?  I would think a chief problem is _txn_find(), but it
>looks like you're not changing that.

The patch maintains the mapping "oid -> pos" for transactions
in which objects for backpointers have been looked up in a cache.

>Or is the problem that the transactions themselves contain many objects
>and the search in _data_find() for the record matching a particular oid
>is expensive.

Right.

We may add another cache for the "transaction -> pos" mapping
to make "_find_txn" faster.

>Either way a cache seems like a decent solution.  I wonder, though, if
>we should rethink the algorithms for these cases.  For pack and undo,
>we're pretty much doing a sequential operation on all the objects.  So
>it might be clearer to rework the APIs in some way to exploit the
>regular access patterns.  On the other hand, a cache works without
>having to rethink the APIs.

I think you refactored the old packing code (which used a linear
copying -- but under hairy and sometimes wrong assumptions)
in order to get the implementation more stable.

Thus, it might be more safe, to use a cache then to go back
to a linear copying algorithm.

-- 
Dieter