[ZODB-Dev] return of corruption errors

Tim Peters tim at zope.com
Tue Oct 19 14:56:39 EDT 2004


[Gerry Kirk]
> Just when I thought all was fine again, running my nightly fstest.py and
> fsrefs.py is turning up errors once more on a backup copy of the db. I'm
> using the standard 'cp' to copy the db. Version of ZODB is 3.2.3.

Have you read the background info here:

    http://zope.org/Wikis/ZODB/FileStorageBackup

?  There's a section on FileStorage corruption at the bottom.  It won't
cheer you up, although you should read it anyway.  If you had said you were
experiencing POSKeyErrors, I would have suggested trying 3.2.4c1 (dang -- I
really need to release 3.2.4c2 ...), but there are no known ways for 3.2.3
ZODB/ZEO to corrupt an .fs file.  In fact, I know of no such thing in any
version of ZODB since I started at Zope, about 4 years ago.  Every resolved
case of .fs corruption I know of was traced to HW, firmware, or system
software bugs.  That leaves some .fs corruption problems whose resolutions
(if any) I never heard about; in some of the latter cases I know the
problems "just went away" after moving to a new box, so presumption has to
favor the HW/firmware/system-bug hypothesis.

> The strange thing is, I don't get errors every night.

That's not strange if it *is* a HW (etc) bug.  It would be strange if it
were a deterministic error in the ZODB/ZEO code.

> Over the weekend, I had a prob one night, and next night no probs
> reported. Then, again last night:
>
>  From fstest.py -
>
> /backup/local/usr/local/Zope/Zeo/var/Data.fs data record exceeds
> transaction record at 607409305: tloc 1681098631 != tpos 607356807
>
>  From fsrefs.py -
> Traceback (most recent call last):
>    File "fsrefs.py", line 107, in ?
>      main(path)
>    File "fsrefs.py", line 65, in main
>      fs = FileStorage(path, read_only=1)
>    File "/usr/local/Zope/SoftwareHome/lib/python/ZODB/FileStorage.py",
> line 289, in __init__
>      read_only=read_only,
>    File "/usr/local/Zope/SoftwareHome/lib/python/ZODB/FileStorage.py",
> line 1928, in read_index
>      name, pos)
>    File "/usr/local/Zope/SoftwareHome/lib/python/ZODB/FileStorage.py",
> line 179, in panic
>      raise CorruptedTransactionError(message)
> ZODB.FileStorage.CorruptedTransactionError:
> /backup/local/usr/local/Zope/Zeo/var/Data.fs data record exceeds
> transaction record at 607409305

The page above explains those tools in more detail than you can get
elsewhere.  fsrefs is a higher-level tool than fstest, and if the latter
fails there's not much point to trying fsrefs.  fstest doesn't look at the
.index file, though.  All I can deduce from the above is what you already
know:  the .fs file is corrupt.

> I tried running fsrecover.py which trimmed Data.fs from 673677466 to
> 673305967. Running fstest.py and fsrefs.py on that produced errors as
> well.

I'm surprised that fstest still complained.  The page above lacks text for
fsrecover.py, because I'm at a loss to explain what it does in a
comprehensible way, or to characterize the conditions under which it may be
useful.  fstest marches over the file once, and throws bytes away
unmercifully whenever it detects corruption of the kind fstest complains
about.  So fstest and fsrecover work at the same level, details of the
lowest-level "glue bytes" holding an .fs file together.  For that reason,
while fsrecover may leave you with an .fs file that frsefs still dislikes,
fstest shouldn't find any more to complain about.  fsrecover will throw the
entire file away, if that's what it takes to get rid of "glue" damage --
there's nothing nuanced about fsrecover, it's brute force all the way.

> Looking at the first error, it looks like maybe the index file is not in
> synch with the db, or just has bad data, as the db has never grown to
> 1.6GB.

The first error is from fstest, and fstest doesn't even open the .index
file.  In fact, fstest is *so* low-level it doesn't import anything beyond a
few standard Python modules.  In contrast, fsrefs imports a ton of
FileStorage implementation code.

> Is it possible that the cache files are causing some of these problems? I
> posted earlier that I was having BadPickleGet and 502 response codes
> until i deleted those files.

Anything's possible here.  BadPickleGet is also a symptom of file (or
memory) corruption, but inside object pickles.  You can replace an object
pickle with any string of nonsense bytes whatsoever, and that can't hurt
anything fstest looks at (fstest knows nothing about application objects --
fstest doesn't even import the pickle module).  So you're seeing corruption
both inside pickles, and in the FileStorage "glue" that holds pickles
together.  To me, that strongly suggests a deeper HW/firmware/network/system
bug of some kind.  If possible, try running your app on a different box?

> So, I'm finding it hard to understand what is going on, since sometimes i
> have probs and sometimes it appears i don't, without any intervention.
> When I try to fix the prob with fsrecover, it still reports a prob.
> (sigh)

There's nothing fsrecover can do to repair corrupt object pickles, so it's
expected that fsrefs may still complain after fsrecover has done its best.
It's not expected that fstest will complain after fsrecover, though (well,
it's not expected by me <wink>).



More information about the ZODB-Dev mailing list