[ZODB-Dev] ZODB Deadlock issue python frame trace

Jeremy Hylton jeremy@digicool.com
Fri, 18 May 2001 12:23:55 -0400 (EDT)


John,

I've got a(nother) potential fix for the problem you discovered.  I'm
much happier with this fix, because I think it solves the problem
pretty much the right way.

The cause of the bug (can't remember if I explained it on the list or
in private email) is that tpc_finish() acquires the storage and db
locks in a different order than DB.open().  As a result, if each
starts at the same time and gets one of the two locks it needs, the
system will be deadlocked.

For a deadlock problem like this, there seem to be two a couple of
simple strategies we can use to fix it.  One is to see if the locks
cover more data than they need to; if they do, we can use two
different locks that each cover less data.  Unfortunately, that
approach doesn't work here.  The DB lock that is held during open()
and during the invalidation phase of tpc_finish() are protecting
exactly the same data structures.

The other simple strategy is to enforce a consistent locking order.
If a thread is going to hold the DB lock and the storage lock, it MUST
acquire the DB lock first.  That's what my proposed solution does.

The DB object gets methods called begin_invalidation() and
finish_invalidation() that acquire and release the DB lock
respectively.  Before the Connection calls tpc_finish() on the
storage, it calls begin_invalidation().  This guarantees that the DB
acquired before the storage lock.

When the invalidation phase is over, the Connection calls
end_invalidation() to release the DB lock.  This is an optimization.
It could wait until tpc_finish() returns, but we know that the DB will
not be used again for the rest of the tpc_finish() and tpc_finish()
could take a long time.

There is a patch below.  It passes that FileStorage test suite, but
that doesn't really test the locking behavior.  Does it solve your
problem?

Jeremy

Index: Connection.py
===================================================================
RCS file: /cvs-repository/Zope2/lib/python/ZODB/Connection.py,v
retrieving revision 1.53
diff -c -c -r1.53 Connection.py
*** Connection.py	2001/05/17 18:35:10	1.53
--- Connection.py	2001/05/18 16:06:16
***************
*** 683,700 ****
      def tpc_finish(self, transaction):
  
          # It's important that the storage call the function we pass
!         # (self.tpc_finish_) while it still has it's lock.  We don't
!         # want another thread to be able to read any updated data
!         # until we've had a chance to send an invalidation message to
!         # all of the other connections!
  
          if self._tmp is not None:
              # Commiting a subtransaction!
              # There is no need to invalidate anything.
!             self._storage.tpc_finish(transaction, self._invalidate_sub)
              self._storage._creating[:0]=self._creating
              del self._creating[:]
          else:
              self._storage.tpc_finish(transaction,
                                       self._invalidate_invalidating)
  
--- 656,674 ----
      def tpc_finish(self, transaction):
  
          # It's important that the storage call the function we pass
!         # (self._invalidate_invalidating) while it still has it's
!         # lock.  We don't want another thread to be able to read any
!         # updated data until we've had a chance to send an
!         # invalidation message to all of the other connections!
  
          if self._tmp is not None:
              # Commiting a subtransaction!
              # There is no need to invalidate anything.
!             self._storage.tpc_finish(transaction)
              self._storage._creating[:0]=self._creating
              del self._creating[:]
          else:
+             self._db.begin_invalidation()
              self._storage.tpc_finish(transaction,
                                       self._invalidate_invalidating)
  
***************
*** 703,715 ****
  
      def _invalidate_invalidating(self):
          invalidate=self._db.invalidate
!         for oid in self._invalidating: invalidate(oid, self)
! 
!     def _invalidate_sub(self):
!         # There's no point in invalidating any objects in a subtransaction
!         #
!         # Because we may ultimately abort the containing transaction.
!         pass
  
      def sync(self):
          get_transaction().abort()
--- 677,685 ----
  
      def _invalidate_invalidating(self):
          invalidate=self._db.invalidate
!         for oid in self._invalidating:
!             invalidate(oid, self)
!         self._db.finish_invalidation()
  
      def sync(self):
          get_transaction().abort()

Index: DB.py
===================================================================
RCS file: /cvs-repository/Zope2/lib/python/ZODB/DB.py,v
retrieving revision 1.29
diff -c -c -r1.29 DB.py
*** DB.py	2001/05/17 18:35:10	1.29
--- DB.py	2001/05/18 16:07:08
***************
*** 311,316 ****
--- 311,326 ----
      def importFile(self, file):
          raise 'Not yet implemented'
  
+     def begin_invalidation(self):
+         # Must be called before first call to invalidate and before
+         # the storage lock is held.
+         self._a()
+ 
+     def finish_invalidation(self):
+         # Must be called after begin_invalidation() and after final
+         # invalidate() call.
+         self._r()
+ 
      def invalidate(self, oid, connection=None, version='',
                     rc=sys.getrefcount):
          """Invalidate references to a given oid.
***************
*** 320,357 ****
          passed in to prevent useless (but harmless) messages to the
          connection.
          """
!         if connection is not None: version=connection._version
!         self._a()
!         try:
  
!             # Update modified in version cache
!             h=hash(oid)%131
!             cache=self._miv_cache
!             o=cache.get(h, None)
!             if o and o[0]==oid: del cache[h]
! 
!             # Notify connections
!             pools,pooll=self._pools
!             for pool, allocated in pooll:
!                 for cc in allocated:
                      if (cc is not connection and
                          (not version or cc._version==version)):
-                         if rc(cc) <= 3:
-                             cc.close()
                          cc.invalidate(oid)
! 
!             temps=self._temps
!             if temps:
!                 t=[]
!                 for cc in temps:
!                     if rc(cc) > 3:
!                         if (cc is not connection and
!                             (not version or cc._version==version)):
!                             cc.invalidate(oid)
!                         t.append(cc)
!                     else: cc.close()
!                 self._temps=t
!         finally: self._r()
  
      def invalidateMany(self, oids=None, version=''):
          if oids is None: self.invalidate(None, version=version)
--- 330,362 ----
          passed in to prevent useless (but harmless) messages to the
          connection.
          """
!         if connection is not None:
!             version=connection._version
!         # Update modified in version cache
!         h=hash(oid)%131
!         o=self._miv_cache.get(h, None)
!         if o is not None and o[0]==oid: del self._miv_cache[h]
  
!         # Notify connections
!         for pool, allocated in self._pools[1]:
!             for cc in allocated:
!                 if (cc is not connection and
!                     (not version or cc._version==version)):
!                     if rc(cc) <= 3:
!                         cc.close()
!                     cc.invalidate(oid)
! 
!         temps=self._temps
!         if temps:
!             t=[]
!             for cc in temps:
!                 if rc(cc) > 3:
                      if (cc is not connection and
                          (not version or cc._version==version)):
                          cc.invalidate(oid)
!                     t.append(cc)
!                 else: cc.close()
!             self._temps=t
  
      def invalidateMany(self, oids=None, version=''):
          if oids is None: self.invalidate(None, version=version)