I've got a weird bug with my server and I'm wondering if anyone could provide some insight<div>into it.</div><div><br></div><div>The general idea of the server is that I have 8 or so clients constantly pinging it for information.</div>
<div>Each ping usually only takes up to 2 seconds to process.</div><div><br></div><div>Essentially, everything runs fine, until a certain code path is executed - basically, one of</div><div>the clients causes the server to try to give it some other information than usual. When this</div>
<div>client is turned on and starts pinging the server, suddenly all those requests that only took</div><div>0.5-2 seconds now take 10, 20, or even 30 seconds. </div><div><br></div><div>I'm using paster and using its thread-tracker tool I can see what all the threads are doing at </div>
<div>any given point. Usually there's anywhere from 0 to 4 requests going on, each having taken</div><div>less than 1 second up to that point. When the 'special client' is turned on, there's more like</div>
<div>10 requests each having taken 6 seconds or more up to that point. There aren't more</div><div>requests, it's just that they don't clear out quickly enough. The tail end of the tracebacks</div><div>are one of two things:</div>
<div><pre class="traceback"> File "/home/tsa/env/lib/python2.6/site-packages/transaction/_manager.py", line 89, in commit
return self.get().commit()
File "/home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py", line 329, in commit
self._commitResources()
File "/home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py", line 441, in _commitResources
rm.tpc_begin(self)
File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 547, in tpc_begin
self._normal_storage.tpc_begin(transaction)
File "/home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py", line 1118, in tpc_begin
self._tpc_cond.wait(30)
File "/usr/lib/python2.6/threading.py", line 258, in wait
_sleep(delay)</pre><pre class="traceback"><div style="font-family:arial;white-space:normal">Or:</div><div style="font-family:arial;white-space:normal"><pre class="traceback"> File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 856, in setstate
self._setstate(obj)
File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 897, in _setstate
p, serial = self._storage.load(obj._p_oid, '')
File "/home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py", line 824, in load
self._load_lock.acquire()</pre></div><div style="font-family:arial;white-space:normal">I read through the code a bit and it sees that only one thread can commit at a time.</div><div style="font-family:arial;white-space:normal">
So, my first guess is that the 'special' request is taking particularly long to commit, which</div><div style="font-family:arial;white-space:normal">ends up slowing everything else down. Would that be a fair estimation? What might</div>
<div style="font-family:arial;white-space:normal">cause a commit to take abnormally long? Just looking at the code I can't see anything</div><div style="font-family:arial;white-space:normal">that particularly stands out, especially since this bug only started happening recently</div>
<div style="font-family:arial;white-space:normal">and I haven't really changed that part of the code on the server-side recently. I did</div><div style="font-family:arial;white-space:normal">refactor the client code but I don't yet see how that might be causing this issue.</div>
<div style="font-family:arial;white-space:normal"><br></div><div style="font-family:arial;white-space:normal">One last thing. I have certain views which touch a lot of different parts of the database and</div><div style="font-family:arial;white-space:normal">
do a lot of processing. I had many conflict errors in the past, and retrying the entire view</div><div style="font-family:arial;white-space:normal">all the time proved to be too inefficient, so I separated the code out into blocks, where</div>
<div style="font-family:arial;white-space:normal">I retry each block separately. This function describes the pattern well:</div><div style="font-family:arial;white-space:normal"><br></div><div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> def commit_and_inner_retry(transaction_manager, func, num_retries, ingdescription</span></font><span style="white-space:normal;font-family:'courier new',monospace;font-size:x-small">)):</span></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> """Using the `transaction_manager`:</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> 1) commit the transaction thus far</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> 2) try `num_retries` times to begin a new transaction, execute `func`, which is</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> passed the current transaction as its only argument, and then commit the</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> new transaction. if ConflictError is raised, abort & try again</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> 3) if a failure happens, then an error is printed involving `ingdescription`</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> 4) raise ConflictError if retrying `num_retries` times does not work.</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> </span></font></div>
<div><span style="white-space:normal;font-family:'courier new',monospace;font-size:x-small"> """</span></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> tm = transaction_manager</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> tm.commit()</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> for retry in range(num_retries):</span></font></div>
<div><span style="white-space:normal;font-family:'courier new',monospace;font-size:x-small"> tm.begin()</span></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> return_value = func(tm.get())</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> try:</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> tm.commit()</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> except ConflictError:</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> print "Conflict error attempt #%d %s, trying again" % (retry+1, ingdescription)</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> tm.abort()</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> continue</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> break</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> else:</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> raise ConflictError("Was never able to commit %s" % ingdescription)</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> </span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> return return_value</span></font></div></div><div><font face="arial"><span style="white-space:normal"><br></span></font></div><div style="font-family:arial;white-space:normal">
With the view code looking something like this:</div><div style="font-family:arial;white-space:normal"><br></div><div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> def long_view(context, request):</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> db_objs = prepare_stuff()</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> </span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> def do_stuff_1():</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> return stuff_1_with_db_objs(db_objs)</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> stuff1 = commit_and_inner_retry(</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> transaction.manager, block, 10, 'doing stuff 1')</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> </span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> def do_stuff_2():</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> return stuff_2_with_db_objs(stuff1)</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> stuff2 = commit_and_inner_retry(</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> transaction.manager, block, 10, 'doing stuff 2')</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> </span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> def do_stuff_3():</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> stuff_3_with_db_objs(stuff2) </span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> try:</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> commit_and_inner_retry(</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> transaction.manager, block, 10, 'doing stuff 3')</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> except ConflictError:</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> #this stuff not important</span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> transaction.abort()</span></font></div>
<div><font size="1" face="courier new, monospace"><span style="white-space:normal"> </span></font></div><div><font size="1" face="courier new, monospace"><span style="white-space:normal"> return response(stuff2)</span></font></div>
</div><div style="font-family:arial;white-space:normal"><br></div><div style="font-family:arial;white-space:normal">It seems like this might cause more transactions than usual. Is it a reasonable design pattern or</div><div style="font-family:arial;white-space:normal">
is there a better way of not having to re-do a lot of computation for a conflict error that only </div><div style="font-family:arial;white-space:normal">affects a small part of that computation?</div><div style="font-family:arial;white-space:normal">
<br></div><div style="font-family:arial;white-space:normal">Thanks,</div><div style="font-family:arial;white-space:normal">- Claudiu</div><div style="font-family:arial;white-space:normal"><br></div><div style="font-family:arial;white-space:normal">
<br></div></pre></div>