[ZODB-Dev] RE: Re: "Time travel" conflict errors in Zope

Mon Dec 20 13:44:51 EST 2004

[Malcolm Cleaton]
> ...
> So, the question becomes, why was the version of the object committed at
> 14:53 being repeatedly loaded into transactions begun after 16:00, when a
> later version was available? Given the known issue with the hard drive
> filling up, it's starting to sound like this is a symptom of a broken ZEO
> client cache file...?

That sounds *most* plausible to me, but I really don't know.  If the ZEO
client cache held on to the 14:53 revision despite that a more recent
revision was available on the ZEO server, then every time the client tried
to modify this object:

1. The client would start with the 14:53 version, from its ZEO client
   cache.

2. The ZEO server would raise ConflictError when the client tried to
   commit, because the ZEO server had a committed version more recent
   than 14:53.

AFAICT then, so long as no other ZEO client (if any) was able to commit a
change to the object either, the ZEO client with the stale revision wouldn't
get an invalidation message for the object, and 1+2 could repeat forever
after.

I don't know how the ZEO client cache could get into this state, but I have
no experience with pathological behaviors (aka "bugs" <wink>) in 3.2's ZEO
client cache code, so that I don't know exactly how it could happen doesn't
really mean anything.  I'll take this as evidence that it can happen.
Holding on to stale objects is bad, bad, bad.

...

> My current theory, in the light of your explanations, is that the
> disk-full event caused the zeo client cache to miss some changes, and
> that everything else we saw is the direct result of this.

That's my theory too <wink>, but again I don't have a precise explanation in
hand (which means I don't know what could be going wrong exactly, so don't
have a clue how to fix it yet either).

> I would guess that, for efficiency reasons, the cache validation
> procedure doesn't think too hard about the possibility of object changes
> in the past that have already been missed, 

Cache verification *intends* to be bulletproof.  If it isn't, it's a bug.
It can take efficiency shortcuts when it believes they're 100% safe, but
will fall back to exhaustive verification if it believes that's necessary.
I suppose it's possible that out-of-disk-space caused a kind of corruption
in the on-disk cache file that the code doesn't detect (and possibly could
not detect -- cache files aren't laced with redundancy or checksums),
violating some invariant the verification dance implicitly relies on.  I
just don't know.

> and that therefore the only way out of such situations is to delete the
> cache files.

When the cache is busted, that conclusion applies regardless of cause -- a
busted cache is unlikely to fix itself by magic <wink>.

> I'd guess further that truncating the ZODB was entirely the wrong thing
> to do, and that deleting the cache files alone would have been
> sufficient, but it does now make sense why rolling the ZODB back to
> before the cache went wrong had some effect.

Ya, there's really no evidence that anything was hosed on the ZEO server
(where the .fs file lives).  After a system failure, it's prudent to run
fstest and fsrecover, but if they don't complain there's really no point to
mucking with the database file.

I personally wouldn't trust any cache file after any system failure, whether
in the context of ZEO or anything else.  But that's just me, not an official
Zope Corp position.  Since you've got strong reason to believe that your ZEO
client cache was in fact messed up after a system failure, if I were you I'd
certainly blow away client cache files after something abnormal happened.

Ah ... over on zope-dev, I see you gave more info, such as that:

> Right after flipping cache files, Zope began to log a flood of errors
> such as:
>
> (our cache file).zec invalidate: oid mismatch: expected 0x0edcd6 read
> '(data)'

That certainly says the cache files got corrupted.  That narrows it down a
teensy bit -- and perhaps this problem can only occur in the presence of
cache file corruption.