[ZODB-Dev] Race condition in basestorage commit locks?

Wed Oct 5 14:17:57 EDT 2005

[Tim]
>> The code does seem to implicitly assume that neither of those _will_
>> raise an exception, and I agree the commit lock release belongs in
>> the `finally` clause instead.
>>
>> I'll change that, but doubt it will make a difference to you:  if either
>> of those did raise an exception, I expect you would have seen a
>> traceback.

[Christian Theune]
> Hmm. I don't know. At least the lock will have been gone away and the
> system would have continued.

But you don't _know_ that, right?  Transaction.py calls tpc_abort only from
here:

    for j in jars:
        try:
            j.tpc_abort(self) # This should never fail
        except:
            LOG('ZODB', ERROR,
                "A storage error occured during object abort. This "
                "shouldn't happen. ", error=sys.exc_info())

As you can see, if tpc_abort() raises an exception from this call site, you
must get the log message shown.  You say later that you don't get such log
messages, therefore tpc_abort() is not raising exceptions from this call
site; and if tpc_abort() is not raising exceptions, then from the code we
looked at last time it's clear that the commit lock does get released.

> It's just plainly stuck in that moment.

That much you know.

> ...
> The storage is ape and it simply loops over all kinds of "connections"
> that it maintains on it's own (which i suppose to go to postgres in our
> case, dunno if ape has anything additional around there). Ape does not
> check for any errors in its _abort() though.

Anyone else know about APE (I don't)?

...

>> Another possibility is that some path thru the code calls tpc_begin on a
>> storage but never calls tpc_finish or tpc_abort on that storage.  Then
>> the storage's commit lock will remain acquired forever.  For example,
>> did the following msg show up in your logs?
>>
>>         LOG('ZODB', PANIC,
>>             "A storage error occurred in the last phase of a "
>>             "two-phase commit.  This shouldn\'t happen. "
>>             "The application will not be allowed to commit "
>>             "until the site/storage is reset by a restart. ",
>>             error=sys.exc_info())
>>
>> Or this one?
>>
>>             LOG('ZODB', ERROR,
>>                 "A storage error occured during object abort. This "
>>                 "shouldn't happen. ", error=sys.exc_info())

> Actually, I found those in the logs, but they were from two month ago.
> Both of them were PANIC messages.

I confess I'm looking at current Zope 2.7-branch code here, but according to
CVS the second message was logged at ERROR level at least back through Zope
2.7.3 too.  So if you really did see the second message logged at PANIC
level, some code outside of ZODB created it.  That would be a good place to
look ;-)

> I'll try to dig through Ape tomorrow and will try moving the release in
> the final block. Unfortunately I have to rely on a very patient customer,
> as we can only provoke this on a live system.
>
> I suspect some of the problems beeing the amount of external systems
> involved with a transaction here. At least it's ZODB, PostgreSQL and
> MySQL within a single transaction. And Ape.

Don't know enough to be sure, but you've seen that the commit-lock code _in_
ZODB is very simple:  tpc_begin acquires it, tpc_{abort,finish} releases it,
and that's it.  I agree it's an error that tpc_abort may not release it if
other code in tpc_abort raises an exception, but in that case you should
also see the log message shown above.