[ZODB-Dev] Storm/ZEO deadlocks (was Re: [Zope-dev] [announce] NEO 1.0 - scalable and redundant storage for ZODB)

Marius Gedminas marius at gedmin.as
Fri Aug 31 11:11:18 UTC 2012


On Thu, Aug 30, 2012 at 11:19:22AM -0600, Shane Hathaway wrote:
> On 08/30/2012 10:14 AM, Marius Gedminas wrote:
> >Here's the code to reproduce it: http://pastie.org/4617132

Updated version with more explicit logging and fewer unnecessary things:
http://pastie.org/4630898

And here's the output: http://pastie.org/4631136

> >The deadlock happens in tpc_begin() in both threads, which is the first
> >phase, AFAIU.
> >
> >AFAICS Thread #2 first performs tpc_begin() for ClientStorage and takes
> >the ZEO commit lock.  Then it enters tpc_begin() for Storm's
> >StoreDataManager and blocks waiting for a response from PostgreSQL --
> >which is delayed because the PostgreSQL server is waiting to see if
> >the other thread, Thread #1, will commit or abort _its_ transaction, which
> >is conflicting with the one from Thread #2.
> >
> >Meanwhile Thread #1 is blocked in ZODB's tpc_begin(), trying to acquire the
> >ZEO commit lock held by Thread #2.

It looks like I mixed up the thread numbers when I was writing this up
last night, i.e. in the above Thread #2 is the one running work1(), and
Thread #1 is the one running work2().  Sorry about that.  In my
defense, that was the order in which they were printed out, which was
the iteration order of the sys._current_frames() dict.

> 
> So thread 1 acquires in this order:
> 
> 1. PostgreSQL
> 2. ZEO
> 
> Thread 2 acquires in this order:
> 
> 1. ZEO
> 2. PostgreSQL
> 
> SQL databases handle deadlocks by detecting and automatically
> rolling back transactions, while the "transaction" package expects
> all data managers to completely avoid deadlocks using the sortKey
> method.
> 
> I haven't looked at the code, but I imagine Storm's StoreDataManager
> implements IDataManager.  I wonder if StoreDataManager provides a
> consistent sortKey.  The sortKey method must return a string (not an
> integer or other object) that is consistent yet different from all
> other participating data managers.

Thread 1 (i.e. work2) acquires the PostgreSQL lock by issuing that
DELETE statement, at which point we haven't started the transaction
commit yet, on either thread.  On the other hand, the second thread
(work1) tries to lock PostgreSQL in the UPDATE statement that happens
during store.flush() that happens during tpc_begin() that happens after
it's already holding the ZEO lock.  So maybe it would be enough to make
Storm's StoreDataManager sort before ZEO always, so we always take
PostgrsSQL locks before ZEO locks.

For the record, adding a store.flush() before transaction.commit()
inside work1() makes this particular instance of the deadlock go away
(and one of the transactions fail with a TransactionRollbackError: could
not serialize access due to concurrent update).

Marius Gedminas
-- 
The worst thing about going out is that you're not in your house.
        -- Kimiko Ross
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Digital signature
URL: <http://mail.zope.org/pipermail/zodb-dev/attachments/20120831/0fdfd095/attachment.sig>


More information about the ZODB-Dev mailing list