[ZODB-Dev] Preliminary notes from fixing a bad data.fs

Mon Jan 10 18:12:52 EST 2005

[ethan fremen]
>> ...
>>> ConflictError: database conflict error (oid 0x05, serial this txn
>>> started with 0x00 1900-01-01 00:00:00.000000, serial currently
>>> committed 0x0339fe733cf5ddf7 2001-01-20 21:39:14.287598)

[Tim Peters]
>> Woo hoo!  Congratulations again -- for the past couple days we've been
>> fruitlessly speculating here about how we could possibly get a conflict
>> error that claimed the "before" serial was 0.  There seem to be a couple
>> reports of that per year, but doesn't look like anyone ever got anywhere
>> with it.

[Dieter Maurer]
> If the "Data.fs" is indeed seriously (!) corrupted, a zero serial might
> be read from it -- when the serial stored with an object state happens to
> be (wrongfully) zero.

That's so, but I doubt it applies here.  Since you doubt it too <wink>,
let's let that drop.

> I do not think this is probably, though. And I seem to remember that some
> FileStorage tools make a consistency check (the serial stored with the
> object must equal the transaction id).

I wish.  Although "the docs" (a comment at the top of FileStorage.py) say:

# A data record consists of
#
...

#   - 8-byte serial, which is a type stamp that matches the
#     transaction timestamp.

that isn't true (apart from just s/type stamp/timestamp/).  I'm not sure
what the full truth is.  The only exception I know about is noted in a
comment later on, inside _commitVersion():

            # If we are aborting, the serialno in the new data
            # record should be the same as the serialno in the last
            # non-version data record.
            # XXX This might be the only time that the serialno
            # of a data record does not match the transaction id.

This is why everyone except you hates versions <wink -- but they do grossly
complicate code all over the place>.

...
> Usually, reading the index makes a heuristic check whether the index
> matches the storage. I had the impression that the check is quite
> reliable -- but a heuristic may err ;-)

I never looked at that code before -- I believe you're talking about
FileStorage._sane().  If so, it's not doing what it says it's doing ("Sanity
check saved index data by reading the last undone trans"), but is checking
some of the data records in the last transactions that have *not* been
undone.  It checks to see that the "administrative overhead" (glue) fields
in the .fs file look reasonable in the last transactions ... and verifies
that the .index file's ideas of the offsets to the current revisions of some
of the last data objects committed matches their actual file offsets.

> However, the probability that the changes to the storage done by
> packing compromise the match with the index without that the heuristics
> detect it seems very low to me.

Alas, I think the chance is 0 on the nose -- unless the pack didn't remove
anything from the file.  Else the file would have "shifted left" at least
one byte during packing, and the post-pack offset of the last data objects
committed would no longer match their pre-pack offsets.

OTOH, the code checks the offsets on up to 5 objects, and I'm not clear on
why (it seems that checking one object's offset would be sufficient).  So
maybe there's another vulnerability I'm just not seeing.

> Moreover, packing stores a new index file (determined during packing).

Well, it tries to.  The layers of exception-suppression in that code mean we
may never know it if it failed to write a new index file.

>> ....
>> Speaking of the dangers of silent data loss, here's a bit of the code
>> that tries to write an .index file:
>>
>>        try:
>>            try:
>>                os.remove(index_name)
>>            except OSError:
>>                pass
>>            os.rename(tmp_name, index_name)
>>        except: pass
>>
>> So if anything at all goes wrong in this part (insufficient permission,
>> bugs in the code that leave index_name or tmp_name wrong, ...), you
>> won't hear anything about it.

> I hit precisely this code when I recently tried to understand why our
> index files were not written reliably -- and I added at least logging for
> problems (in our Zope sources).

So do you think these broad "except" clauses serve a purpose?  I really wish
people would write comments when they're suppressing exceptions, but too
late here.  I can guess that maybe the innermost try/except is trying to
guard against the case where "the old" index file doesn't currently exist
(although "if os.path.exists(index_name)" is a much better way to ask about
that).  I've got no guess for why we'd want to hide that a renaming
operation failed.

> It turned out that the problem has not been there but in "zdaemon". When
> "zdaemon" shuts down a service, it sends it a SIGTERM, then waits for up
> to "backoff-limit" seconds (10 s by default) for the service to stop and
> sends it a SIGKILL, if it does not. Thus, when writing the index(es) does
> not finish within 10 s, the new index is not installed -- without anyone
> learning about it...
>
> I have a feature request with solution in the collector to give "zdaemon"
> a separate configuration option "shutdown-timeout" to control shutdown
> behaviour. Hopefully, it finds its way into the official Zope sources ;-)

Hey, sign a contributor agreement, and I bet you can have complete control
over absolutely everything in zdaemon <0.5 wink>.