[ZODB-Dev] Race condition in basestorage commit locks?

Wed Oct 5 13:26:08 EDT 2005

[Christian Theune]
> this is nasty. We detected a couple of hangs with a Zope 2.7.7 final and
> (using the DeadlockDebugger) we found this situation in every of the four
> threads: (See attached file for complete listing of all threads)

And there are 4 threads each stuck at self._commit_lock_acquire() in
BaseStorage.BaseStorage.tpc_begin().

> ...
> We then looked at the locks in BaseStorage, and what I found is that in
> tpc_finish, the commit_lock will be freed in any case (using a finally
> clause) whereas in tpc_abort, there might be a race condition that does
> not free the commit_lock. (the finally clause only covers a second,
> different lock).

I wouldn't call this "a race", because nothing here appears to be
timing-dependent.  This is the code:

    def tpc_abort(self, transaction):
        self._lock_acquire()
        try:
            if transaction is not self._transaction:
                return
            self._abort()
            self._clear_temp()
            self._transaction = None
            self._commit_lock_release()
        finally:
            self._lock_release()

That will release the commit lock if and only if neither of:

            self._abort()
            self._clear_temp()

raises an exception.  The code does seem to implicitly assume that neither
of those _will_ raise an exception, and I agree the commit lock release
belongs in the `finally` clause instead.

I'll change that, but doubt it will make a difference to you:  if either of
those did raise an exception, I expect you would have seen a traceback.
This code doesn't _suppress_ exceptions, right?  I'd also look at the source
code for whatever storages you're using, to see whether/how they override
_abort() and/or _clear_temp().  If any of those _can_ raise exceptions, then
(a) they probably ought to be changed so that they cannot; but, (b) that
indeed could provoke BaseStorage.tpc_abort() into leaving the commit lock
acquired.

> Additionally, we are using Ape running on a postgres backend, so this
> might trigger some unusual side effects, maybe this possible race
> condition.

It's also possible for storages (deriving from BaseStorage) to do their own
unique things with the locks BaseStorage created.  That is, _just_ staring
at BaseStorage doesn't tell us what the other stuff you're using may be
doing with these locks.  For example, FileStorage mucks with the commit lock
too, to give packing a way to block commits during the final copy.

> (If someone has another suspicion where this hang might come from, I'm
> all your's to listen)

As above, FileStorage can block commits "for a long time" if a pack is going
on.  That's not really a hang, but can _appear_ to be a hang until packing
completes.  There's no evidence in your msg that packing was going on,
though.

Another possibility is that some path thru the code calls tpc_begin on a
storage but never calls tpc_finish or tpc_abort on that storage.  Then the
storage's commit lock will remain acquired forever.  For example, did the
following msg show up in your logs?

            LOG('ZODB', PANIC,
                "A storage error occurred in the last phase of a "
                "two-phase commit.  This shouldn\'t happen. "
                "The application will not be allowed to commit "
                "until the site/storage is reset by a restart. ",
                error=sys.exc_info())

Or this one?

                LOG('ZODB', ERROR,
                    "A storage error occured during object abort. This "
                    "shouldn't happen. ", error=sys.exc_info())