[ZODB-Dev] A certain code path seems to be blocking everything

Fri Feb 15 20:16:23 UTC 2013

I figured out what was wrong and it was so convoluted I thought I'd share
it.

It ended up having nothing to do with zodb *per se*. I had a thread which,
whenever it processes a certain kind of internal message, loaded & saved a
~1.5mb pickled object from disk. The 'special client' ended up triggering
this thread twice/request. It was using pickle instead of cPickle, and I
think due to the GIL it was just thrashing everything else, though in
particular it seemed the zodb commit code was impacted particularly hard.

Using cPickle instead of pickle seems to have fixed the issue for now, but
I'm going to change that part of the code to be more efficient.

- Claudiu

On Thu, Feb 14, 2013 at 4:24 PM, Claudiu Saftoiu <csaftoiu at gmail.com> wrote:

> I've got a weird bug with my server and I'm wondering if anyone could
> provide some insight
> into it.
>
> The general idea of the server is that I have 8 or so clients constantly
> pinging it for information.
> Each ping usually only takes up to 2 seconds to process.
>
> Essentially, everything runs fine, until a certain code path is executed -
> basically, one of
> the clients causes the server to try to give it some other information
> than usual. When this
> client is turned on and starts pinging the server, suddenly all those
> requests that only took
> 0.5-2 seconds now take 10, 20, or even 30 seconds.
>
> I'm using paster and using its thread-tracker tool I can see what all the
> threads are doing at
> any given point. Usually there's anywhere from 0 to 4 requests going on,
> each having taken
> less than 1 second up to that point. When the 'special client' is turned
> on, there's more like
> 10 requests each having taken 6 seconds or more up to that point. There
> aren't more
> requests, it's just that they don't clear out quickly enough. The tail end
> of the tracebacks
> are one of two things:
>
>   File "/home/tsa/env/lib/python2.6/site-packages/transaction/_manager.py", line 89, in commit
>     return self.get().commit()
>   File "/home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py", line 329, in commit
>     self._commitResources()
>   File "/home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py", line 441, in _commitResources
>     rm.tpc_begin(self)
>   File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 547, in tpc_begin
>     self._normal_storage.tpc_begin(transaction)
>   File "/home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py", line 1118, in tpc_begin
>     self._tpc_cond.wait(30)
>   File "/usr/lib/python2.6/threading.py", line 258, in wait
>     _sleep(delay)
>
> Or:
>
>   File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 856, in setstate
>     self._setstate(obj)
>   File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 897, in _setstate
>     p, serial = self._storage.load(obj._p_oid, '')
>   File "/home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py", line 824, in load
>     self._load_lock.acquire()
>
> I read through the code a bit and it sees that only one thread can commit
> at a time.
> So, my first guess is that the 'special' request is taking particularly
> long to commit, which
> ends up slowing everything else down. Would that be a fair estimation?
> What might
> cause a commit to take abnormally long? Just looking at the code I can't
> see anything
> that particularly stands out, especially since this bug only started
> happening recently
> and I haven't really changed that part of the code on the server-side
> recently. I did
> refactor the client code but I don't yet see how that might be causing
> this issue.
>
> One last thing. I have certain views which touch a lot of different parts
> of the database and
> do a lot of processing. I had many conflict errors in the past, and
> retrying the entire view
> all the time proved to be too inefficient, so I separated the code out
> into blocks, where
> I retry each block separately. This function describes the pattern well:
>
>     def commit_and_inner_retry(transaction_manager, func, num_retries,
> ingdescription)):
>         """Using the `transaction_manager`:
>         1) commit the transaction thus far
>         2) try `num_retries` times to begin a new transaction, execute
> `func`, which is
>            passed the current transaction as its only argument, and then
> commit the
>            new transaction. if ConflictError is raised, abort & try again
>         3) if a failure happens, then an error is printed involving
> `ingdescription`
>         4) raise ConflictError if retrying `num_retries` times does not
> work.
>
>         """
>         tm = transaction_manager
>         tm.commit()
>         for retry in range(num_retries):
>             tm.begin()
>             return_value = func(tm.get())
>             try:
>                 tm.commit()
>             except ConflictError:
>                 print "Conflict error attempt #%d %s, trying again" %
> (retry+1, ingdescription)
>                 tm.abort()
>                 continue
>             break
>         else:
>             raise ConflictError("Was never able to commit %s" %
> ingdescription)
>
>         return return_value
>
> With the view code looking something like this:
>
>     def long_view(context, request):
>         db_objs = prepare_stuff()
>
>         def do_stuff_1():
>             return stuff_1_with_db_objs(db_objs)
>         stuff1 = commit_and_inner_retry(
>             transaction.manager, block, 10, 'doing stuff 1')
>
>         def do_stuff_2():
>             return stuff_2_with_db_objs(stuff1)
>         stuff2 = commit_and_inner_retry(
>             transaction.manager, block, 10, 'doing stuff 2')
>
>         def do_stuff_3():
>             stuff_3_with_db_objs(stuff2)
>         try:
>             commit_and_inner_retry(
>                 transaction.manager, block, 10, 'doing stuff 3')
>         except ConflictError:
>             #this stuff not important
>             transaction.abort()
>
>         return response(stuff2)
>
> It seems like this might cause more transactions than usual. Is it a
> reasonable design pattern or
> is there a better way of not having to re-do a lot of computation for a
> conflict error that only
> affects a small part of that computation?
>
> Thanks,
> - Claudiu
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.zope.org/pipermail/zodb-dev/attachments/20130215/021bed8e/attachment.html>