[ZODB-Dev] A certain code path seems to be blocking everything

Thu Feb 14 21:24:51 UTC 2013

I've got a weird bug with my server and I'm wondering if anyone could
provide some insight
into it.

The general idea of the server is that I have 8 or so clients constantly
pinging it for information.
Each ping usually only takes up to 2 seconds to process.

Essentially, everything runs fine, until a certain code path is executed -
basically, one of
the clients causes the server to try to give it some other information than
usual. When this
client is turned on and starts pinging the server, suddenly all those
requests that only took
0.5-2 seconds now take 10, 20, or even 30 seconds.

I'm using paster and using its thread-tracker tool I can see what all the
threads are doing at
any given point. Usually there's anywhere from 0 to 4 requests going on,
each having taken
less than 1 second up to that point. When the 'special client' is turned
on, there's more like
10 requests each having taken 6 seconds or more up to that point. There
aren't more
requests, it's just that they don't clear out quickly enough. The tail end
of the tracebacks
are one of two things:

  File "/home/tsa/env/lib/python2.6/site-packages/transaction/_manager.py",
line 89, in commit
    return self.get().commit()
  File "/home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py",
line 329, in commit
    self._commitResources()
  File "/home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py",
line 441, in _commitResources
    rm.tpc_begin(self)
  File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py",
line 547, in tpc_begin
    self._normal_storage.tpc_begin(transaction)
  File "/home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py",
line 1118, in tpc_begin
    self._tpc_cond.wait(30)
  File "/usr/lib/python2.6/threading.py", line 258, in wait
    _sleep(delay)

Or:

  File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py",
line 856, in setstate
    self._setstate(obj)
  File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py",
line 897, in _setstate
    p, serial = self._storage.load(obj._p_oid, '')
  File "/home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py",
line 824, in load
    self._load_lock.acquire()

I read through the code a bit and it sees that only one thread can commit
at a time.
So, my first guess is that the 'special' request is taking particularly
long to commit, which
ends up slowing everything else down. Would that be a fair estimation? What
might
cause a commit to take abnormally long? Just looking at the code I can't
see anything
that particularly stands out, especially since this bug only started
happening recently
and I haven't really changed that part of the code on the server-side
recently. I did
refactor the client code but I don't yet see how that might be causing this
issue.

One last thing. I have certain views which touch a lot of different parts
of the database and
do a lot of processing. I had many conflict errors in the past, and
retrying the entire view
all the time proved to be too inefficient, so I separated the code out into
blocks, where
I retry each block separately. This function describes the pattern well:

    def commit_and_inner_retry(transaction_manager, func, num_retries,
ingdescription)):
        """Using the `transaction_manager`:
        1) commit the transaction thus far
        2) try `num_retries` times to begin a new transaction, execute
`func`, which is
           passed the current transaction as its only argument, and then
commit the
           new transaction. if ConflictError is raised, abort & try again
        3) if a failure happens, then an error is printed involving
`ingdescription`
        4) raise ConflictError if retrying `num_retries` times does not
work.

        """
        tm = transaction_manager
        tm.commit()
        for retry in range(num_retries):
            tm.begin()
            return_value = func(tm.get())
            try:
                tm.commit()
            except ConflictError:
                print "Conflict error attempt #%d %s, trying again" %
(retry+1, ingdescription)
                tm.abort()
                continue
            break
        else:
            raise ConflictError("Was never able to commit %s" %
ingdescription)

        return return_value

With the view code looking something like this:

    def long_view(context, request):
        db_objs = prepare_stuff()

        def do_stuff_1():
            return stuff_1_with_db_objs(db_objs)
        stuff1 = commit_and_inner_retry(
            transaction.manager, block, 10, 'doing stuff 1')

        def do_stuff_2():
            return stuff_2_with_db_objs(stuff1)
        stuff2 = commit_and_inner_retry(
            transaction.manager, block, 10, 'doing stuff 2')

        def do_stuff_3():
            stuff_3_with_db_objs(stuff2)
        try:
            commit_and_inner_retry(
                transaction.manager, block, 10, 'doing stuff 3')
        except ConflictError:
            #this stuff not important
            transaction.abort()

        return response(stuff2)

It seems like this might cause more transactions than usual. Is it a
reasonable design pattern or
is there a better way of not having to re-do a lot of computation for a
conflict error that only
affects a small part of that computation?

Thanks,
- Claudiu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.zope.org/pipermail/zodb-dev/attachments/20130214/389341be/attachment-0001.html>