[ZODB-Dev] session problems

Sat Dec 24 21:13:27 EST 2005

[Florent Guillaume]
> I've been debugging session problems for two days, I feel it's time to
> write down what I've observed and ask for other eyes to look at it (Chris
> McDonough has been working on this too). This is all on Zope 2.9 trunk
> BTW (ZODB 3.6.0b5 and Zope 2.9's tempstorage) with python 2.4.2.

Most people at Zope Corp are off the rest of the year, so don't expect much.
I don't know anything about tempstorage myself, but since I'm on vacation
too that doesn't much matter ;-)

> What I observed was an unnatural number of repeated ConflictError (by
> that, I mean "write" conflicts) followed by more and more
> ReadConflictErrors as soon as you go beyond the time
> CONFLICT_CACHE_MAXAGE of TemporaryStorage.
>
> To simplify debugging, I've boosted that constant and I only debug the
> write conflict errors.
>
> The first write conflict happens when a BTree can't resolve a conflict.
> The transaction is then aborted.
>
> Here, it should happen what happens correctly for FileStorage, the
> connections' _flush_invalidations should get called and it shoud reset
> the _txn_time of the connection to None so that the modified oids
> (including the BTree's), when invalidated, reset the _txn_time to their
> serial. So that on the next conflict, _setstate_noncurrent calls
> loadBefore with that serial.
>
> But apparently the _flush_invalidations() of the connection is never
> called. So _txn_time is never bumped into the future (and in turn, means
> the next write conflict will try to load exactly the same serials as
> before and fail again, etc.) .
>
> This seems to happen because:
>
> 1. the connection has _synch to True: it has registered itself has a
> synchronizer, and expects its afterCompletion to be called when (among
> others) the transaction is aborted, and the afterCompletion is calling
> _flush_invalidations,
>
> 2. the synchronizer (the connection itself) has been lost from the
> transaction's _serializers WeakSet for some reason (garbage collected I
> guess). It was there in earlier transactions, but it's not there at the
> time it's needed.
>
> If someone can make sense of this...
>
> Actually I don't know why the connection (=synchronizer) could be gone
> from the transaction's _sychronizers WeakSet but still be in the DB's
> connection pool WeakSet. I guess here lies the problem.

That's a great question.  It doesn't seem possible that it's gc (unless
there's a relevant deep weakref gc bug remaining in Python, which I think is
unlikely).

A Transaction never removes anything from its ._synchronizers set.

However, Transaction.__init__() gets its ._synchronizers set from the
transaction manager that creates the transaction, and the
TransactionManager._synchs set is deliberately mutable:  a Transaction
"sees" (in its ._synchronizers set) any changes made to the corresponding
TransactionManager._synchs set (these "two" sets are the same object, just
with different names).

While a transaction manager never removes a synchronizer from its ._synchs
set on its own initiative, anyone can call
ITransactionManager.unregisterSynch(s) to force removal of synchronizer `s`.
Then `s` will vanish both from TransactionManager._synchs and
Transaction._synchronizers (again, they're really the same set object).

In ZODB, the only caller of unregisterSynch() is Connection.close().

So that's the plain obvious way for this to happen:  someone called
cn.close() on the Connection `cn` in question.  Are you sure that's not all
there is to it?  Closing the connection would remove `cn` from the
transaction manager's ._synchs, and from the transaction's ._synchronizers.
It would _not_ remove `cn` from the DB's connection pool's .all, though.
There are only two ways a Connection `cn` can ever get out of .all:

1. There are no strong references to `cn` remaining, so gc reclaims `cn`.

or

2. `cn` isn't currently in use, hasn't been in use for so long that
   it's bubbled to the front of the .available queue, and enough
   other Connections get closed that len(.available) exceeds pool_size.
   Then the "oldest" excess available connections are explicitly removed
   from both .available and .all.

> Also, I don't know why we don't observe this for FileStorage, maybe
> something has a hard reference on it somewhere?

It doesn't sound likely to me like hard references are relevant -- but then
I really don't know anything about tempstorage.

HTH, and good luck in either case ;-)