[ZODB-Dev] Persistent ZEO Cache corruption?

Thu Jan 12 10:17:54 EST 2006

[Sidnei da Silva]
>> Every now and then I face a corruption of the persistent zeo cache, but
>> this is the first time I get this variant.

What other variants do you see?

>> The cause is very likely to be a forced shutdown of the box this zope
>> instance was running on, but I thought it would be nice to report the
>> issue.

Yes it is!  Thank you.  It would be better to open a bug report ;-).

>> Here's the traceback::
>>
>> File "/home/sidnei/src/zope/28five/lib/python/ZEO/ClientStorage.py", line
314, in __init__
>>   self._cache.open()
>> File "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line 112, in
open
>>    self.fc.scan(self.install) File
>> "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line 835, in scan
>>    install(self.f, ent) File
>> "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line 121, in
install
>>   o = Object.fromFile(f, ent.key, skip_data=True)
>> File "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line 630, in
fromFile
>>   raise ValueError("corrupted record, oid")
>> ValueError: corrupted record, oid
>>
>> I have a copy of the zeo cache file if anyone is interested.

Attaching a compressed copy to the bug report would be best (if it's too big
for that, or it's proprietary, let me know how to get it and I'll put it on
an internal ZC machine).  Can't tell in advance whether that will reveal
something useful, though (see below).

>> What is bad about this problem is that it prevented Zope from starting
>> and there is no obvious clue that removing the persistent zeo cache
>> would cure it, though that's what anyone that has a clue about what he's
>> doing would do *wink*.

[Jim Fulton]
> It sounds like there should be logic in that code to abandon the cache if
> a problem is found, much as we abandon file-storage index files if
> anything seems suspicious.

That's an excellent idea, and should be doable with finite effort.

> It seems as though persistent caches haven't been a very sucessful
> feature. Perhaps we should abandon them.

They do seem to be implicated in more than their share of problems, both
before and after MVCC.

The post-MVCC ZEO persistent cache _intends_ to call flush() after each file
change.  If it's missing one of those, and depending on what "forced
shutdown" means exactly, that could be a systematic cause for corruption.
It doesn't call fsync() unless it's explicitly closed cleanly, but it's
unclear what good fsync() actually does across platforms when flush() is
called routinely and the power stays on.

Those were intended to be reliability improvements over the pre-MVCC file
cache (which never called flush() or fsync()).

"kill -9" can do damage regardless, though.

Alas, if the cause is one of those, it's doubtful that analyzing the
corrupted file could prove it.

It's generally true that our file formats aren't designed to detect
corruption (e.g., we don't include checksums of any sort), so it's not clear
how well we can detect corruption either.  The only post-MVCC ZEO file cache
gimmick in this direction is storing 8 redundant bytes per record (the oid
for the record is stored near the start of the record, and again at the
end).  That's a lot better than nothing, and a mismatch in these redundant
oids is precisely what caused Sidnei's traceback.

The only other kinds of corruption routinely detected are weak checks:  we
hit EOF when trying to read a record's version, or when trying to read a
record's pickle.

There is one other strategy, but it only kicks in if you're not running
Python with -O (but should probably be run unconditionally at startup, and
changed to stop using "assert" statements):

        if __debug__:
            self._verify_filemap()

That checks that the info read from the cache file is mutually consistent in
various ways; Sidnei's run didn't get that far.