[ZCM] [ZC] 2151/ 2 Comment "Random ValueError due to inconsistent ZEO persistent cache code"

Wed Dec 20 17:08:45 EST 2006

Issue #2151 Update (Comment) "Random ValueError due to inconsistent ZEO persistent cache code"
 Status Pending, Database/bug critical
To followup, visit:
  http://www.zope.org/Collectors/Zope/2151

==============================================================
= Comment - Entry #2 by ctheune on Dec 20, 2006 5:08 pm

Wow. This *is* a nasty error.

I spent some time trying to make sens out of this. I've tried to trace all possible callers and their interactions starting from settid().

I've ruled out a number of edge cases that might conflict here, but I didn't finally solve it.

I'm pretty sure that the tid passed to settid() must be strictly larger than the one before, but I can't prove it (yet). (My feeling
comes from the fact that most places that cause settid() to be called come from the invalidation code. This shouldn't cause
the same tid to be passed into the settid() twice.)

I do have the feeling that some race condition is around. The only
reason I can see for this is that another ZEO client might be
committing at the very same moment and the server's tpc_finish
gets called. This would trigger an invalidation from the server and calls invalidateTransaction.

I can imagine an unfortunate situation where the client would ask
for the current validations *and* get a new invalidation request
from the server at the same time. 

This could be ruled out, if the server can't send any requests
to the client until notifyConnected has finished.
________________________________________
= Request - Entry #1 by chrisw on Jul 17, 2006 7:51 am

Any server running an affected version of ZEO with a persistent client side cache may, at random intervals, die and start emitting tracebacks as below until the .zec files are removed and the app server is restarted:

Traceback (most recent call last):
  File "/usr/local/zope/2.9.3/lib/python/ZEO/zrpc/client.py", line 506, in notify_client
    self.client.notifyConnected(self.conn)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/ClientStorage.py", line 521, in notifyConnected
    self.verify_cache(stub)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/ClientStorage.py", line 589, in verify_cache
    self.invalidateTransaction(*pair)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/ClientStorage.py", line 1114, in invalidateTransaction
    self._cache.setLastTid(tid)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/cache.py", line 151, in setLastTid
    self.fc.settid(tid)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/cache.py", line 1060, in settid
    raise ValueError("new last tid (%s) must be greater than "
ValueError: new last tid (245115767599186653) must be greater than previous one (245115767599186653)

Okay, so why have I logged this as critical?
Well, because no-one seems to understand this code and it's got pretty wide ranging effects. The quick solution is do remove persistent client caches completely, but this would be a big shame as they can be a huge performance win in certain circumstances.

So, does ANYONE want to stump up and make a call on whether the < should be changed to <= or whether this is actualyl a real bgu caused by some other race condition?

Some Zope 2 related details:
http://mail.zope.org/pipermail/zodb-dev/2006-April/010031.html
http://mail.zope.org/pipermail/zodb-dev/2006-June/010157.html

A standalone ZODB example:
http://www.zope.org/Collectors/Zope/2038

A Zope 3 example:
http://mail.zope.org/pipermail/zope3-dev/2006-July/019697.html
==============================================================