[ZODB-Dev] Corrupt db when trying to pack

Fri Oct 1 18:29:55 EDT 2004

[Gerry Kirk]
> I tried to pack a day old backup of our db, and ran into this:
>
> Exception Type  	CorruptedError
> Exception Value> 	/home/local/Zope/jim/var/Data.fs:291529514:data
                        record does not point to
                        transaction header: 893605468050 != 252270482
>
> Traceback (innermost last):
>
>      * Module ZPublisher.Publish, line 101, in publish
>      * Module ZPublisher.mapply, line 88, in mapply
>      * Module ZPublisher.Publish, line 39, in call_object
>      * Module App.ApplicationManager, line 428, in manage_pack
>      * Module ZODB.DB, line 555, in pack
>      * Module ZODB.FileStorage, line 1570, in pack
>      * Module ZODB.fspack, line 700, in pack
>      * Module ZODB.fspack, line 455, in findReachable
>      * Module ZODB.fspack, line 485, in buildPackIndex
>      * Module ZODB.fspack, line 180, in checkData
>      * Module ZODB.fspack, line 161, in fail
>
> CorruptedError: /home/local/Zope/jim/var/Data.fs:291529514:data record
> does not point to transaction header: 893605468050 != 252270482
>
> After looking through other posts on the mailing list, the problem may
> have arisen because of some memory conflicts - we purchased some memory
> and found it conflicted with the existing memory.

You didn't say which version(s) of Zope or ZODB you're using, but there are
indeed no known software causes for FileStorage corruption in current
releases of Zope or ZODB.  You should look this over now:

    http://zope.org/Wikis/ZODB/FileStorageBackup 

> Anyhow, I tried running fsrecover.py, and that reduced the 700+ MB file
> by 24 MB. What worries me is what is in the 24 MB. The ZODB hasn't been
> packed in a few weeks.
>
> What else can I do? I ran fstest.py, and it stopped at this:
>
> Data.fs data record exceeds transaction record at 291529514: tloc
> 893605468050 != tpos 252270482

So that's the same message you saw above.  Since you did pause to read the
link above <wink>, I can "explain" what the message is saying.  There's a
data record (a pickle for one modified object) starting at file offset
291529514.  The transaction record that data record belongs to starts at
file offset 252270482.  We can tell that this was a "large" transaction,
because the data record here is 40MB (291529514 minus 252270482) beyond the
start of its transaction record.  You could probably find out more about
this transaction by running fsdump.py.

The header for each data record contains the file offset to the transaction
record to which the data record belongs, and that's called "tloc" in the
message above.  It *should* equal tpos, 252270482.  But it doesn't, it
equals 893605468050.  So this .fs file is corrupted at a very low level.

Maybe we can get a clue about the nature of the damage by looking at tloc's
value in hex:

>>> hex(893605468050)
'0xD00F095792L'

No luck.  It appears more-or-less random, with a mix of high-bit and control
bytes, so it's almost certainly not misplaced ASCII data.  As a decimal
number, it's heading close to a terabyte, so it's not a valid file offset
either.  Since that's the only clue I have, that's where I give up.  There's
no way to guess from here what might have been in the 24MB you lost.

It's impossible to explain what fsrecover.py does at a semantic level (i.e.,
in terms of objects and their relationships), because fsrecover doesn't work
at that level.  There are a few kinds of redundancy in the *structure* of a
FileStorage (such as that data records each contain the offset to the
transaction record they belong to), and fsrecover works at that level:  when
it finds one of these low-level redundancy checks that fails, it simply
throws away bytes, "left to right", until it finds a region of bytes that
doesn't look insane according to these low-level checks.

Running fsrefs.py *may* ease your mind:  if fsrefs has no complaints about
the file fsrecover created, then you're not missing the current revision of
any object.

OTOH, fsrefs may find more kinds of corruption than fstest found:  fstest
doesn't look inside object pickles, so they could be total trash (this is
highly unlikely -- but what you're seeing is also highly unlikely) and
fstest wouldn't know it.  fsrefs does look inside object pickles.

If you find you need that data, and you don't have a good backup, then your
options are unattractive:

- Become an expert in FileStorage forensic analysis.

- Hope that such an expert volunteers to do it for you.

- Pay such an expert to do it for you.

There's no guarantee of success in any of those cases either.

For the future, take the recommended practices listed on the link above to
heart.  For example, had you made daily backups all along, repairing
"impossible" corruption in the first 300MB of a 700MB FileStorage probably
would have been straightforward.